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NEARLY UNBIASED VARIABLE SELECTION UNDER MINIMAX 

CONCAVE PENALTY 

By Cun-Hui Zhang 1 

Rutgers University 

We propose MC+, a fast, continuous, nearly unbiased and accu- 
rate method of penalized variable selection in high-dimensional linear 
regression. The LASSO is fast and continuous, but biased. The bias 
of the LASSO may prevent consistent variable selection. Subset se- 
lection is unbiased but computationally costly. The MC+ has two 
elements: a minimax concave penalty (MCP) and a penalized linear 
unbiased selection (PLUS) algorithm. The MCP provides the con- 
vexity of the penalized loss in sparse regions to the greatest extent 
given certain thresholds for variable selection and unbiasedness. The 
PLUS computes multiple exact local minimizers of a possibly noncon- 
vex penalized loss function in a certain main branch of the graph of 
critical points of the penalized loss. Its output is a continuous piece- 
wise linear path encompassing from the origin for infinite penalty to 
a least squares solution for zero penalty. We prove that at a universal 
penalty level, the MC+ has high probability of matching the signs 
of the unknowns, and thus correct selection, without assuming the 
strong irrepresentable condition required by the LASSO. This selec- 
tion consistency applies to the case of p 3> n, and is proved to hold 
for exactly the MC+ solution among possibly many local minimiz- 
ers. We prove that the MC+ attains certain minimax convergence 
rates in probability for the estimation of regression coefficients in l r 
balls. We use the SURE method to derive degrees of freedom and C p - 
type risk estimates for general penalized LSE, including the LASSO 
and MC+ estimators, and prove their unbiasedness. Based on the 
estimated degrees of freedom, we propose an estimator of the noise 
level for proper choice of the penalty level. For full rank designs and 
general sub-quadratic penalties, we provide necessary and sufficient 
conditions for the continuity of the penalized LSE. Simulation results 
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overwhelmingly support our claim of superior variable selection prop- 
erties and demonstrate the computational efficiency of the proposed 
method. 

1. Introduction. Variable selection is fundamental in statistical analysis 
of high-dimensional data. With a proper selection method and under suitable 
conditions, we are able to build consistent models which are easy to inter- 
pret, to avoid over fitting in prediction and estimation, and to identify rele- 
vant variables for applications or further study. Consider a linear model in 
which a response vector y € M. n depends on p predictors Xj 6 M n , j = 1, . . . ,p, 
through a linear combination Y2j= l fij x j- For small p, subset selection meth- 
ods can be used to find a good guess of the pattern {j : f3j ^ 0}. For example, 
one may impose a proper penalty on the number of selected variables based 
on the AIC [Akaike (1973)], C p [Mallows (1973)], BIC [Schwarz (1978)], 
RIC [Foster and George (1994)] or a data driven method. For large p, sub- 
set selection is not computationally feasible, so that continuous penalized or 
gradient threshold methods are typically used. 

Let || • || be the Euclidean norm. Consider a penalized squared loss 

v 

(1.1) L(b; A) = (2n)^ 1 ||y - Xb|| 2 + J>(|M; A) 

with a penalty p(t; A) indexed by A > 0, in the linear regression model 

p 

(1.2) y = 2^ / 8 7 -x J -+e, 

j'=i 

where X = (xi, . . . ,x p ), /3 = (/?i, . . . , f3 p )' and e ~ N(0, a 2 l n ). Assume the 
penalty p(t; A) is nondecreasing in t and has a continuous derivative p(t; A) = 
(d j dt) p(t; A) in (0,oo). Assume further p(0+; A) > 0, so that minimizers of 
(1.1) have variable selection features with zero components [Donoho et al. 
(1992)]. Changing the index A if necessary, we assume p(0+; A) = A whenever 
p(0+; A) < oo, so that A has the interpretation as the threshold level for the 
individual coefficients flj under the standardization ||xj|| 2 = n. 

A widely used penalized lease squares estimator (LSE) is the LASSO 
[Tibshirani (1996)] or equivalently Basis Pursuit [Chen and Donoho (1994)], 
with p(t;\) = X\t\. The LASSO is easy to compute [Osborne, Presnell and 
Turlach (2000a, 2000b) and Efron et al. (2004)] and has the interpretation 
as boosting [Schapire (1990), Freund and Schapire (1996) and Friedman, 
Hastie and Tibshirani (2000)]. Throughout the paper, let 

(1.3) A° = {j:^^0} and d° = \A°\ = #{j : Pj + 0} 

unless otherwise stated. Under a strong irrepresentable condition on the 
normalized Gram matrix X! = X'X/n, Meinshausen and Buhlmann (2006), 
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Tropp (2006), Zhao and Yu (2006) and Wainwright (2006) proved that the 
LASSO is variable selection consistent 

(1.4) P{A = A°}^1 withA = {j:%^0}, 

provided that min^.^o \ 1S greater than the £oo — > norm of the in- 
verse of a diagonal sub-matrix of 5] of rank d°, among other regularity con- 
ditions on {\,n,p,e} . However, the strong irrepresentable condition, which 
essentially requires the £oo — > norm of a (p — d°) x d° matrix of 5] to be 
uniformly less than 1, is quite restrictive for moderately large d°, and that 
due to the estimation bias, the condition is nearly necessary for the LASSO 
to be selection consistent. Here the bias of a penalized LSE is treated as its 
estimation error when e = 0. Under a relatively mild sparse Riesz condition 
on the £2 — > £2 norm of sub-Gram matrices and their inverses up to a cer- 
tain rank, Zhang and Huang (2008) proved that the dimension \A\ for the 
LASSO selection is of the same order as d° and that the LASSO selects all 
variables with \f3j\ above a certain quantity of the order ^fd?\. These results 
are still unsatisfactory in view of the possibility of incorrect selection and the 
extra factor \fdP with the condition on the order of \ f3j \ for correct selection, 
compared with the threshold level A. Again, due to the estimation bias of 
the LASSO, the extra factor \[d° cannot be completely removed under the 
sparse Riesz condition. From these points of view, the bias of the LASSO 
severely interferes with variable selection when p and d° are both large. 

Prior to the above mentioned studies about the interference of the bias 
of the LASSO with accurate variable selection or conditions to limit such 
interference, Fan and Li (2001) raised the concern of the effect of the bias of 
more general penalized estimators on estimation efficiency They pointed 
out that the bias of penalized estimators can be removed almost com- 
pletely by choosing a constant penalty beyond a second threshold level 7A, 
and carefully developed the SCAD method [Fan (1997)] with the penalty 
A J min{l, (7 — x/A) + /(7 — l)}dx, 7 > 2. Iterative algorithms were devel- 
oped there and in Hunter and Li (2005) and Zou and Li (2008) to approx- 
imate a local minimizer of the SCAD penalized loss for fixed (A, 7). For 
penalized methods with unbiasedness and selection features, Fan and Peng 
(2004) proved the existence, variable selection consistency (1.4) and asymp- 
totic estimation efficiency of some local minimizer of the penalized loss under 
the dimensionality constraint p = o(n r ) with r = 1/3, 1/4 or 1/5 depending 
on regularity conditions. Their results apply to general classes of loss and 
penalty functions but do not address the uniqueness of the solution or pro- 
vide methodologies for finding or approximating the local minimizer with 
the stated properties, among potentially many local minimizers. A major 
cause of computational and analytical difficulties in these studies of nearly 
unbiased selection methods is the nonconvexity of the minimization problem. 
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A number of recent papers have considered LASSO-like or LASSO-based 
convex minimization procedures. Candes and Tao (2007) proposed a Dantzig 
selector and provided elegant probabilistic upper bounds for the £2 loss for 
the estimation of (3. However, while the Dantzig selector and LASSO have 
been found to perform similarly in simulation studies [Efron, Hastie and 
Tibshirani (2007), Meinshausen, Rocha and Yu (2007) and Candes and Tao 
(2007), page 2401], little is known about the selection consistency of the 
Dantzig selector. Multiple-stage methods either share certain disadvantages 
of the LASSO for variable selection or require additional nontechnical side 
conditions, compared with our results. Current theory on such procedures 
has been focused on fixed p or d°, while the most interesting case is p 3> 
n > d° — > 00. Post-LASSO selection [Meinshausen (2007)] or bootstrapped 
LASSO [Bach (2008)] may not recover false nondiscovery of the LASSO 
(Section 6.5). Adaptive LASSO [Zou (2006), Huang, Ma and Zhang (2008) 
and Zou and Li (2008)] requires an initial estimator of (3 based on which 
small penalty levels could be assigned to most j3j 7^ and large penalty levels 
to most f3j = 0. The nonnegative garrotte estimator [Yuan and Lin (2007)] 
requires an initial estimator to be within o(A) from (3. For p> n, correlation 
screening [Fan and Lv (2008)] requires A° to be a subset of the indices of 
the m largest values of |x^y|/||xj || with a certain m<n. 

The main purpose of this paper is to propose and study an MC+ method- 
ology. The MC+ provides a fast algorithm for nearly unbiased concave pe- 
nalized selection in the linear model (1.2). The selection consistency (1.4) 
hol ds for the c omputed MC+ solution at the universal penalty level A un i v = 
o^J (2/n) logp [Donoho and Johnston (1994b)], without assuming the strong 
irrepresentable condition or requiring min^^o|/3j|/A un i v to be greater than a 

quantity of the order \fdP or the — > norm of a matrix of rank d° . This 
selection consistency holds up to dimension d° <d*, including the case of 
p ^> n > d° — > 00, and this upper bound d*, determined by the sparse Riesz 
condition on X, could be as large as n/\og(p/n). We further prove that the 
£ q loss of the MC+ attains minimax convergence rates in probability for the 
estimation of (3 in t r balls with 0<r<lAg<2. We also consider a general 
theory of penalized LSE, including the continuity of estimators, unbiased 
estimation of risk, and the estimation of noise level, in addition to variable 
selection and the estimation of (3. This paper is written based on Zhang 
(2007b), an April, 2007 Rutgers University Technical Report containing all 
the results in Sections 3, 4 and 5 with more extensive discussion of the PLUS 
algorithm and less explicit constants in the selection consistency theorems. 
A brief description of Zhang (2007b) can be found in Zhang (2008), which 
contains some additional simulation results. 

2. A sketch of main results. We provide a brief description of the MC+ 
method and our main results, along with some crucial concepts, conditions 
and necessary notation. 
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2.1. The MC+. The MC+ has two components: a minimax concave 
penalty (MCP) and a penalized linear unbiased selection (PLUS) algorithm. 
The MCP is defined as 

(2.1) p(t;\) = \ [ (l-x/( 7 X)) + dx 

Jo 

with a regularization parameter 7 > 0. It minimizes the maximum concavity 

(2.2) k( P ) = k(p;X)= sup {p(h; A) - p(t 2 ; A)}/(* 2 - *i) 

0<ti<t 2 

subject to the following unbiasedness and selection features: 

(2.3) p(t;X) = Vt>7A, p(0+;A) = A. 

For AC. {1, ... ,p}, define sub-design and sub-Gram matrices 

(2.4) X A = e A) nx \ A \, 'S AtB = X. ; A X B /n, Y: a = 'S a ,a- 

Let d* be a positive integer. The penalized loss (1.1) is sparse convex with 
rank d* if it is convex in all models {b : bj = Vj ^ A} with \A\ < d* . This 
sparse convexity condition holds if the convexity of the squared loss ||y — 
Xb|| 2 /(2n) overcomes the concavity of the penalty in all such sparse models 
with I .A I < d* , or equivalently 

(2.5) k(p; A) < min c min (XU) where cw^a) = min ||Sau||. 

|A|<d* l|u||=l 

Although the unbiasedness and selection features (2.3) preclude convex 
penalties, the MCP provides the sparse convexity to the broadest extent 
by minimizing the maximum concavity (2.2). This is a natural motivation 
for the MCP. The MCP achieves k(p;X) = 1/7. A larger value of its reg- 
ularization parameter 7 affords less unbiasedness and more concavity. For 
each penalty level A, the MCP provides a continuum of penalties with the 
t\ penalty at 7 = 00 and the u £q penalty" as 7 — > 0+. 

Given a penalty p{-\ •), A © (3 G ]R 1+P is a critical point of the penalized 
loss in (1.1) if 



(2.6) 




XP)/n = Bg D .(P j )p(\P j \;\), (3j^0, 
-X/3)/n|<A, S = 0, 



where sgn(t) = I{t > 0} — I{t < 0}. For convex penalized loss, (2.6) is the 
Karush-Kuhn-Tucker (KKT) condition for the global minimization of (1.1). 
In general, solutions of (2.6) include all local minimizers of L(-; A) for all A. 
The graph of the solutions of (2.6) is studied in Section 3. Consider 

u ^.(x) f a continuous path of solutions of (2.6) in 
(2.7) X y ' © p = < ^(0) , . 

I with (3 = and lim^^oo X^ = 0. 
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For the MCP, we prove in Section 3.3 that almost everywhere in (X, y), a 
path (2.7) uniquely exists up to continuous transformations of x from [0, oo) 

onto [0, oo) and that f3 ends at a point of global least squares fit as x — > oo. 
Thus, in the graph of the solutions of (2.6), (2.7) provides a unique branch 
encompassing from the origin (3 = to an optimal fit. We call (2.7) the main 
branch of the solution graph. For concave penalties, solutions of (2.6) may 
form additional branches as loops not connected to the origin (Figure 3). In 
the PLUS algorithm, the integer part of x in (2.7) represents the number of 
computational steps and the fraction part represents the linear interpolation 
between steps as in (3.8). 

Given a penalty level A, we propose as a variable selector and an estimator 
of /3 

(2.8) 3(A)=3 (:CA) in (2.7) with x A = inf{x > : A^ < A}, 

or equivalently the solution when the penalty level A is first reached in the 
path. The estimator (2.8) and the global minimum of (1.1) may not be the 
same for nonconvex penalized loss. Still, the uniqueness of (2.7) implies that 
/3(A) is uniquely defined, including the case of p > n. We call (2.8) the MC+ 
if the MCP (2.1) is used in (2.6) and thus (2.7). 

The PLUS algorithm computes the main branch (2.7) of the solution 
graph of (2.6) for quadratic spline penalty functions of the form p(t; A) = 
\ 2 p(t/X). The PLUS is described in detail and studied in Section 3. For the 
quadratic spline penalties, the graph of solutions of (2.6) is piecewise linear 
and so is its main branch (2.7). The PLUS differs from existing nonconvex 
minimization algorithms in three important aspects: (i) it computes the 
exact value of local minimizers instead of iteratively approximating them; 
(ii) it computes a path of possibly multiple solutions for the entire range 
of the penalty level A > instead of a single solution for a fixed A; (hi) it 
computes multiple local minimizers for individual A by tracking along its 
path of solutions for different values of A instead of trying to jump from the 
domain of attraction of one solution to another for a fixed A. In each step, 
the PLUS computes one line segment in its path between two turning points, 
and its computational cost is the same as the LARS [Efron et al. (2004)] per 
step. The MC+ with larger rcgularization parameter 7 provides smoother 
estimators and computationally less complex path, but larger bias and less 
accurate variable selection. The MC+ path converges to the LASSO path 
as 7 — > 00. 

2.2. Some simulation results and heuristics. The proposed MC+ pro- 
vides fast, continuous, nearly unbiased and accurate variable selection in 
high-dimensional linear regression, as our theoretical and numerical results 
support. 
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Table 1 

Performance of LASSO, MC+ and SCAD+ in experiment 1. 100 replications, n = 300, 
p = 200, 0,/<r = l/2, 7 = 2/(1 -max^fe|x^x fe |/n), 7 = 2.69, CS = I{A = A°}, 
SEf} = ||3-/3|| 2 , k = #(steps), log(<7A/(5A univ )) = integer/20 



d° = 10 d° = 20 d° = 40 



A/ct LASSO MC+ SCAD+ LASSO MC+ SCAD+ LASSO MC+ SCAD+ 



A/5 


CS 


0.45 


0.77 


0.71 


0.09 


0.87 


0.62 


0.00 


0.81 


0.27 


— Auniv/ & 


SE/3 


0.340 


0.063 


0.131 


0.831 


0.160 


0.480 


2.097 


0.452 


1.842 


= 0.188 


fc 


12 


16 


26 


23 


31 


50 


47 


63 


127 


Fixed A/5 


A/5 


0.266 


0.248 


0.248 


0.257 


0.231 


0.195 


0.231 


0.195 


0.169 


for maxCS 


CS 


0.88 


0.98 


0.92 


0.44 


0.97 


0.70 


0.01 


0.83 


0.45 




k 


11 


11 


17 


21 


23 


47 


44 


60 


149 


Fixed A/a 


A/5 


0.076 


0.153 


0.138 


0.060 


0.138 


0.124 


0.042 


0.138 


0.120 


forminSEp SE^ 


0.154 


0.043 


0.041 


0.287 


0.082 


0.080 


0.502 


0.167 


0.161 




A: 


41 


22 


34 


65 


43 


67 


102 


84 


169 



Table 1 presents the results of experiment 1 of our simulation study to 
demonstrate the superior selection accuracy and competitive computational 
complexity of the MC+, compared with the LASSO and SCAD. Since there 
are quite a few different ways of (approximately) computing possibly dif- 
ferent SCAD local minimizers, we denote by SCAD+ the PLUS solution 
of the SCAD. We measure the selection accuracy by the proportion CS of 
replications with the correct selection CS = I {A = A°}, and the computa- 
tional complexity by the average k of the number of the PLUS steps. In this 
experiment, (n,p) = (300, 200), y is generated with /3j = ±/3* for j E A° and 
e ~ ^"(0,1,,,) in (1.2), and Xj are generated by greedy sequential sampling 
(Section 6.1) of groups of 10 most correlated vectors from a pool of 600 vec- 
tors from the sphere {x: ||x|| = y/n}. The design X, A°, the signs of (3 and 
£ are drawn independently for the 100 replications with d° G {10,20,40}. 
The a 2 is the residual mean squares with 100 degrees of freedom in the full 
200-dimensional model. 

The dimensions (n,p,d°) in experiment 1 are moderate, but larger than 
those in some recent simulation studies of other nonconvex minimization 
algorithms. This modest setting allows us to demonstrate the significance 
of the impact of d° = #{j:/5j ^ 0}, and thus the estimation bias, on the 
selection consistency of the LASSO in the absence of difficulties involving 
ultrahigh dimensionality or the singularity with rank(X) < p. More simula- 
tion results are presented in Section 6 with (n,p) = (300,2000), (600,3000), 
(100,2000) and (200,10,000) to demonstrate the scalability of the PLUS 
algorithm, among other issues. 

Why is the MC+ able to avoid both the interference of estimation bias 
with variable selection and the computational difficulties with nonconvex 



8 



C.-H. ZHANG 



minimization? A short, heuristic explanation is that for standardized ||xj|| = 
y/n and a carefully chosen 7 > 1, the condition 

(2.9) / a» = min{|# 7 -|:j€A°}>7A with A > A univ = ay / (2/n) bgp, 

and the sparsity of (3 are allowed to match the extent of the unbiasedness 
and sparse convexity of the MC+. The lower bound for /3* in (2.9) allows 
unbiased selection of all j £ A°, while the lower bound for A prevents selec- 
tion of variables outside A° given the selection of all variables in A° . Thus, 
(2.9) guarantees with large probability that the LSE 



with the oracular choice B = A°, is one of the local minimizers of the pe- 
nalized loss. Meanwhile, the sparse convexity (2.5) provides the uniqueness 
among sparse solutions of (2.6) and controls the computational complexity 
of the MC+. 

This argument does not work with the LASSO due to the estimation bias. 
Let /3 be the t\ oracle with (3 Ao (\) = (3 A o — X~S A l sgn(/3^ D ) and /3»(A) = 
for j $ A°. By the KKT condition, sgn(3(A)) = sgn(/3) for the LASSO if 
and only if (iff) |x'-(y — X/3 (A))|/n < A and sgn(/3 (A)) = sgn(/3). However, 
j3 (A) is biased with E(3° Ao (\) — (3 A o = —\Yl A l sgn(/3 A o) 7^ 0. 

2.3. Selection consistency. We study the selection consistency of the pe- 
nalized LSE under the sparse Riesz condition (SRC) on X: for suitable 
< c* < c* < 00 and rank d* , 

(2.11) c* < min c min (T, A ) < max c max (E A ) < c* , 

|A|<d* |A|<d* 

where is as in (2.4) and c max (M) is the largest eigenvalue of M. Con- 
ditions on X and (3 must be configured to accommodate each other in our 
theorems. In this subsection, we study selection consistency for d° < d* = 
d* /(c* /c* + 1/2). In the next subsection, we study estimation by comparing (3 
and the oracle estimator (2.10) with \B\ < d*. Section 4 covers more general 
configurations. Although {d* , c* , c*} are all allowed to depend on n, the SRC 
is easier to understand with fixed {c*,c*} and large d* = d*, which asserts 
the equivalence of the norms HXbH/y^ and ||b|| up to : bj 7^ 0} = d* . 
Define p t = Pp,^^ by 



for nonnegative integers m G [1, j?— d°] and reals e E (0, 1]. Note that 21ogp e > 
1. 



(2.10) 



3° = argmin{||y - Xb|| 2 : bj = Vj i B) 



b 



(2.12) 
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Theorem 1. Suppose (1.2) holds with ||xj|| 2 = n. Let A°, d° and A be 

as in (1.3) and (14) and 3° be as in (2.10) with B = A° . Suppose (2 .11) 
holds and d° < d* = d*/(c*/c* + l/2). Let \ 1>e = a^/(2/n) log((p - d°)/e) and 
^2,€ > max{2\/c*o"^/ (2/n) \ogp e , \i, e }> where e 6 (0, 1] is fixed and p e is de- 
fined with m = d* — d° . Let w° be the largest diagonal element of S^o. Let 

(3 = /3(A) with a deterministic or random A, where /3(A) is the MC+ selector 
(2.8) with 7 > c~ l ^jA + c^/c* . Then 

(2.13) P0 ^ 3° or sgn(3) + sgn(/3)} < P{\ $ [Ai >6 , A 2 , £ ]} + (3/2 + l/\/2>, 

provided that (3* = min, eJ 4° |/3j| > a^/w°{2/n) \og{d° /e) + 7A2. e - Moreover, 
(1-4) holds and the MC+ estimator (3 achieves the estimation efficiency of 
the oracle LSE (3 , provided that P{\\ e < A < A2, e } — > 1 and e" 1 V min{p — 
d°,pi, ^n/w ^ - 7A 2 , e )/o-} -> oo. 

Corollary 1. iei A univ = cry / (2/n)logp. Suppose \\xj || 2 = n, d*/(c*/ c * + 
1/2) >d°-»oo, 7 > c~ 1 v / 4 + c !t /c* and/3* > o-yV^/n) logd° + 7max{2 x 
\/?ffV(2/n) log pi, A^} in (1.2), (2.11) and (2.1). Then P0(\ univ ) / 3° 
or sgn(3(A univ )) / sgn(/3)} ->■ 0. 

A lower bound condition on /3* can be viewed as an information require- 
ment for selection consistency. A variation below of Proposition 1 in Zhang 
(2007c) asserts that the condition on /3* in Theorem 1 is optimal up to a 
factor of 47\/c*(l + o(l)) when logd° = o(logp). 

PROPOSITION 1. For /3gF let A° and d° be as in (1.3), /3* = mva.j & A° 
and y be as in (1.2) with ||xj|| 2 = n. Let p, d° and a > be dependent on n 
with p — d° — > oo . Then 

liminf inf ^ sup sup P{A / A°} > 1 - 4c 2 Vc> 0, 

n ^°° (X,y)->-l|A |=d /3»=cAi,i 

where A^i = cry 7 (2/n) log(p — d°) and i/ie infimum is taken over all Borel 
mappings. 

Remark 1. Since (2.13) is nonasymptotic, {p, d* ,c*,c* ,d° , (3, a, e} are 
all allowed to depend on n. The requirement d° < d* = d*/(c*/c* + 1/2) 
could be viewed as a condition on the sparsity of (3 given {d* , c*, c*}. On the 
other hand, for given d° = d° it is closely related to the restricted isometry 
constant 5d = max{| HS^uH — 1| : \A\ = d, ||u|| = 1} [Candes and Tao (2005)], 
although c* > 2 is allowed in (2.11). For example, d° < d*/(c*/ c * + 1/2) is 
a consequence of 5%d° < 3/7 with explicit d* = 3d°, c* = 4/7 and c* = 10/7. 
With larger A 2 , e / \/ a 2 logp e and 7, Theorem 5 allows fixed d*/d* > (c*/c* + 
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l)/2, which is a consequence of 52d° < 1/2 or 5^d° < 2/3. See Remark 5 in 
Section 4. 

Remark 2. For p 3> n, random matrix theory provides the possibil- 
ity of d° x n/log(p/n). For example, if the rows of X are i.i.d. Gaussian 
vectors with E~K = and c\ < E'llXbp/n < C2 for all ||b|| = 1, then 
P{(2.11) holds} I with fixed c* = (1 - 5) 2 a, c* = (1 + <5) 2 c 2 and d* = 
max{d: yjd/n(\ + ^2 + 21og(p/d)) < 5}, where < 5 < 1 is fixed [Davidson 
and Szarek (2001), Candes and Tao (2007), Wainwright (2006) and Zhang 
and Huang (2008)]. 

Remark 3. The condition e ~ N(0, <J 2 I n ) is not essential. In particu- 
lar, Corollary 1 holds if the normality assumption is replaced by the sub- 
Gaussian condition Ee*~ e < e CT i H x l I 2 Vx, provided o\ < a 2 . See Section 7.3 
and Lemma 2. 

Theorem 1 compares favorably with existing results in the required reg- 
ularity of X and the information content in the data as measured in /3* = 
min^.^ol/^jl- For the LASSO, a bound similar to (2.13) on selection consis- 
tency essentially requires 

(2.14) & > oV w°(2/n) log(d°/e) + 0*A and A > Ai, e /(1 - 0%)+, 

where 0\ = ||E^ sgn(/3 Ao )||oo and B\ = HE^c^oX^ sgn(/3 j4o )|| 00 [Mein- 
shausen and Buhlmann (2006), Tropp (2006), Zhao and Yu (2006) and 
Wainwright (2006)]. The maxima of 9\ and Q\ over the unknown sgn(/3 j 4o) 
are, respectively, the norms HX^oHoo and HS^jc^oS^ for linear map- 
pings between spaces. Consider the case of d° — > oo. The strong irrep- 
resentable condition, which requires Q\ < 1 uniformly strictly, is restrictive 
since H^^c^oS^o ||oo is not length normalized. For log(p/cf ) x logp — > oo, 
(Ty/(2/n)logp e = (l + o(l))Ai, e by (2.12), so that Theorem 1 replaces 9$/(l- 
0r>)+ in (2.14) by 27\/c* as a required lower bound for the signal-to-noise 
ratio (SNR) /3^/Ai^. For logp = (1 + o(l))logcT, for example, nlogn 
and d* x nj log log n, logpi = o(l) logp, so that Corollary 1 simply requires 
/?* > ay / w°(2/n) log d° + 7 A un i v for (1.4) when c* = 0(1). A commonly used 
bound is 0\ < Vd? /c min (Y, A °)- Wainwright (2006) proved H^Hoc = P {\) 
when the rows of X^o are i.i.d. Gaussian vectors with IK-EE^o) -1 ^ = 0(1). 
The adverse effects of large d° on the LASSO selection are evident in our 
simulation experiments. 

In addition to conditions on X and (3, Theorem 1 makes significant ad- 
vances by allowing the exact universal penalty level A un i v for selection con- 
sistency (Corollary 1) in the case of a known a 2 or A = <7a/ (2/n) logp with 
any consistent upper confidence bound a in the case of unknown a, while 
the penalty level A in (2.14) depends on A° via the £oo norm Q\. 
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From these points of view, the thrust of Theorem 1 is to replace the strong 
irrepresentable condition by the SRC with d° < d* /(c* /c* + 1/2), to replace 
the — > loo norm of matrices of rank d° by the £2 — > £2 norm of matrices 
of rank no greater than d°(c* /c* + 1/2) in the requirement on and to 
completely remove the factor 1/(1 — #2)+ 011 X, compared with (2.14). 

2.4. Estimation of regression coefficients. We have shown the selection 
consistency of the MC+ up to \A°\ < d* = cf/ (c*/c* + 1/2) under (2.11). This 
selection consistency is proved via an upper bound on the false positive in 
Theorem 6 which naturally leads to performance bounds for the estimation 
of f3. Although we do not fully address the topic here, we present a theorem 
to highlight the consequences of our oracle inequalities. 

Let ||b||q = (Ysj=i l^jl 9 ) 1 ^ 9 be the £ q norm with the usual extension to 
q = 00 and Q r< R = {b: ||b|| r < R} be the £ r ball. It was proved recently in 
Ye and Zhang (2009) that for all 1 < r V 1 < q and < e < 1 

(2.15) liminfinf inf ^ sup P{0 - /3||| > (1 - e)R r Xf^} > 4 

subject to ||xj|| 2 = n in (1.2), where the second infimum is taken over all 
Borel mappings of proper dimension and 

A mm = a{(2/n) log(a r p/(n r / 2 R r ))} 1/2 , 

provided that R r /X l mm — > 00 and nA^ m /cr 2 — > 00. This minimax lower bound 
for the £ q loss is an extension of the lower bound for the minimax £ q risk in 
Donoho and Johnstone (1994a). The following theorem provides sufficient 
conditions for the PLUS estimator (2.8) to attain this minimax rate. 

Theorem 2. Let k > and p(t; A) be a penalty satisfying A(l — K|t|/A)+ < 
p(\t\', X) < X. Suppose (2.11) holds with certain d* andc* > c* > k^J 4 + c*/c* . 
Let B be a deterministic subset of {1, ... ,p} with \B\ < d* = d*/(c*/c* + l/2). 
Let 3(A) be as in (2.8) and 3° as in (2.10). Let 6 B = ||X(/3 - Ep°)\\/y/n 
and p e be as in (2.12) with m = d* — \B\ and d° = \B\. 



(i) Let A > 2\/c*((T-y/ (2/n) logp e + 0B/y/rn). Then, with at least proba- 
bility 1 - e/ A/41ogp e; 

f 1 1/2 

(2.16) c, ||3(A) -3°ll < £> 2 (|&|;A) +(A/2)v^B|< (3/2)Xy/\B\. 

(ii) Suppose R r /X r mra = \B\. Let X = 2\/c*{A mm (l + -y/2c7) + e\a/^/n} 
with the A mm in (2.15) and a fixed e± > 0. Let 0<r<lAg<2 and M q = 
(Mf^+Mf^V^ where M 1>q = (c* /c* + l/2) 1 /«- 1 / 2 3(y/*/c*)(l+y/2£+ 
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€2) and Mi A = {(c*/c* + £2/c*)^ 2 + l} 1 ^ with a fixed €2 > 0. Then, with 
{p,R,cr,d* ,c*,c* ,M} all allowed to depend on n, 

(2.17) sup P{||3(A) - f3\\\ > M^RTX^} -» 

as nX^/a 2 ->• 00, where Q TtR = {(3 : Yfj=i \PjY A Kim < RV } 2 ©r,_R- 

Remark 4. We may choose the set B in Theorem 2(i) to minimize 6b 
or given a size but we are not confined to these examples. 

The condition R r /X r mm = \B\ in Theorem 2(ii) is not restrictive, since X and 
a could be scaled by a bounded factor to meet it. Remarks 1, 2 and 3 are 
applicable to Theorem 2 with 7 = 

The oracle inequality (2.16) exhibits the advantage of the MC+ when 
a fraction of \/3j\ are of the order A, since the MCP with 7 = 1/k has the 
smallest possible p(t; X) = (1 — Kt/X)+ under the assumption on the penalty. 

For fixed < c* < c* < 00, (2.17) provides the convergence of /3(A) based 
on (X, y) at the minimax rate (2.15), up to significant f3j} X i? r /A„ im < 
d* = cT/(c*/c* + 1/2), including the case of p 3> n > d° — > 00. Such uniform 
convergence rates in l r balls cannot be obtained from existing results requir- 
ing penalty levels A > A un j v = cry 7 (2/n) logp in the case of A mm /A un i v — > 0. 
Theorem 2 closes this gap by allowing A >c A mm . We observe that A mm < A un i v 
in (2.15) whenever R > a/y/n. The relevance of smaller A mm is evident in 
our simulation experiments where the best penalty levels for estimation are 
all less than or equal to A un i v . See Section 6.1 in addition to Table 1. For 
recent advances in the LASSO or LASSO-like estimations of X/3 and (3, we 
refer to Greenshtein and Ritov (2004), Candes and Tao (2007), Bunea, Tsy- 
bakov and Wegkamp (2007), van de Geer (2008), Zhang and Huang (2008) 
and Meinshausen and Yu (2009). 

2.5. Organization of the rest of the paper. Section 3 provides an explicit 
description of the PLUS algorithm and studies the geometry of the solutions 
of the estimating equation (2.6). Section 4 studies the selection consistency 
of both the global minimizer of (1.1) and the local solution (2.8) for gen- 
eral penalties. Section 5 develops methods for the estimations of the mean 
squared error (MSE) of the penalized LSE and the noise level in the linear 
model (1.2). Section 6 reports simulation results. Section 7 contains some 
discussion. 

3. The PLUS algorithm and quadratic spline penalties. We divide this 
section into three subsections to cover quadratic spline penalties, the PLUS 
algorithm and the existence and uniqueness of the MC+ path. An R package 
"plus" has been released. 
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Fig. 1. The £i penalty pi(i) = t for the LASSO along with the MCP p 2 (t) and the SCAD 
penalty ps(t), t>0, j = 5/2. Left: penalties p m (i). Right: their derivatives p m {t). 

3.1. Quadratic spline penalties and the MCP. The PLUS algorithm as- 
sumes that the penalty function is of the form p(t; A) = X 2 p(t/X), where p(t) 
is a nondecreasing quadratic spline in [0, oo). Such p(t) must have a piecewise 
linear nonnegative continuous derivative p(t) for t > 0, so that the solution 
graph of (2.6) is piecewise linear. The maximum concavity n(p) = n(p; A) 
does not depend on A. We index p(t) by the number of threshold levels m, 
or equivalently the number of knots in [0, oo), including zero as a knot. Thus, 



with u\ = 1, v m = 0, t m+ \ = oo and knots t\ = < ti < ■ ■ ■ < t m = 7, satisfy- 
ing m - Vit i+ i = m+i - v i+ it i+ i > 0, 1 < i < m. 

We set p m (0+) = «i = 1 to match the standardization /j(0+;A) = A in 
(2.3), and v m = for the uniform boundedness of p(t; A). The unbiasedness 
feature lim^oo p(t\ A) = demands t m = 7 > = u m = v m and thus m > 1, 
but the PLUS includes the LASSO with m= 1. For ||xj|| 2 = n, c m ; n (SA) < 1, 
so that (2.5) becomes K(p m ) =maxj< m Uj < c* < 1 under (2.11). 

The penalty class (3.1) includes the l\ penalty with m = 1 and n(pi) = 0, 
the MCP with m = 2 and k(p2) = «i = 1/7, and the SCAD penalty with 
m = 3, vi = 0, ti = 1 and ^(^3) = f2 = 1/(7 — 1). We plot these three penalty 
functions p m , m = 1, 2, 3 and their derivatives in Figure 1, with 7 = 5/2 for 
the MCP and SCAD penalty. 

As mentioned in the Introduction, we propose the MCP (2.1) as the de- 
fault penalty for the PLUS, and thus the acronym MC+. The MCP corre- 
sponds to (3.1) with 



(3.2) p 2 (t)=min{t-t 2 /(27),7/2}, fa(t) = (1 - t/j)+, t>0. 



p{t; A) = X 2 p m (t/X), p m (t) = (dp m /dt)(t) 



(3.1) 



■in 




i=i 
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Among spline penalties satisfying (2.3), the MCP has the smallest number 
of threshold levels m = 2. It follows from (2.6) and (3.1) that the piecewise- 
linear PLUS path makes a turn whenever \f3j{\)/\\ hits one of the m thresh- 
olds for any j < p. From this point of view, MC+ is the simplest for the PLUS 
to compute except for the LASSO with m = 1. 

3.2. Explicit description of the PLUS algorithm. Let z = X'y/n. For 
penalty functions of the form p(t;X) = X 2 p m (t/X) with the p m in (3.1), the 
estimating equation (2.6) is equivalent to the following rescaled version: 



(3.3) 



Zj-7djb = 8gn(b : j)p m (\bj\), bj^O, 
\z l j -x' J h\<l = p m (0+), bj = 0, 

through the scale change z/A — > z and (3/X — > b, where Xj = X'xj/n are the 
columns of £ = X'X/n. The solution b(z) of (3.3) along the ray {z/A, A > 
0} provides the solution of (2.6) with the inverse transformation /3(A) = 
Ab(z/A). 

We shall "plot" the solution b(z) of (3.3) against z to allow multiple 
solutions, instead of directly solving it for a given z = z/A = X'y/(nA). In 
the univariate case p = 1 , we plot functions in M 2 . For p > 1 , we need to 
consider b versus z in R 2p . Let H = W, H* be its dual, and z © b be 
members of H © H* = R 2p . Define 

ti, < i < m + 1, 

-t\i\ +1 , -m<i<0, 



(3.4) u(i) = u\i\, v(i)=v\i\, t(i) 



where Ui,Vi and U specify p m as in (3.1). For indicators r] £ {— m, . . . ,m} p , 
let 



S(rj) = the set of all z 

(3.5) 



satisfying 



Zj ~ Xjb = sgn(r?j)n(77j) - bjv{rjj), rjj / 0, 

-l<^--x;-b<l, 7?, =0, 
%)<^<% + l), 

6j = 0, 7]j = 0. 



Since sga(bj)p m (\bj\) = sgn(r]j)u(r]j)-bjv(rij) for t(rjj) < bj < t(rjj + l), (3.3) 
holds iff (3.5) holds for a certain rj. For each r], the linear system in (3.5) 
is of rank 2p, since one can always uniquely solve for b and then z if the 
inequalities are replaced by equations. Thus, since (3.5) has p equations and 
p pairs of parallel inequalities, S(rj) are p-dimensional parallelepipeds living 
in H@H* = R 2p . Due to the continuity of p m {t) = (d / dt) p m (t) in t by (3.1) 
and that of zj — x'^° m both z and b, the solutions of (3.5) are identical in the 
intersection of any given pair of S(rj) with adjacent 77. Furthermore, the p- 
dimensional interiors of different S(rj) are disjoint in view of the constraints 



MINIMAX CONCAVE PENALTY 



15 



of (3.5) on b. Thus, the union of all the p-parallelepipeds S(rj) forms a 
continuous p-dimensional surface S = L){S(rj) : ij G {— m, . . . , m} p } in H ® 
H* = M. 2p . This continuous surface S is the solution set (or the "plot") of all 
z © b G H © H* satisfying the rescaled estimating equation (3.3). 

Given z = X'y/n, the solution set of (3.3) for all z = r z and r > 0, or 
equivalently that of (2.6) for all A, is identical to the intersection of the 
surface S and the (p + 1) -dimensional open half subspace {(rz) © b:r > 
0,b G H*} in R 2p . Figure 2 depicts the MC+ and LASSO solution sets 
and the projections of S(rj) to H in the nonoverlapping scenario [under the 
convexity condition (2.5) with full rank d* = p = 2]. Figure 3 depicts an 
overlapping scenario in which the complete solution set of (2.6) contains the 
main branch covered by the MC+ path and a loop not covered. 

The rescaled PLUS path in H © H* is a union of connected line segments 

k* 

(3.6) \Je(rj^\z), ^(r 7 |z)=5(r ? )n{(rz)ffib:r>0,bGif*}, 
fe=o 

beginning with £(rj^\z) = {(rz) ©0:0 < t < t^ )}, r}(°> = and connected 
at 

(3.7) {{^ k -^z)®b^} = e(r 1 ^\z)r\£{r 1 ^\z), z = X'y/n. 

Given (r^z) © b( fc_1 ) , we find a new; line segment £(r]^\z) and compute 
the other end of it as (t^ ' z)ffib( fe ), fc>l. Given z, we write the turning 
points in the simpler form ffib( fc ) G R 1+p . The PLUS path (2.7) is defined 
through the linear interpolation of © and reverse scale change from 




Fig. 2. Left: the solid ray as rz and the projections of the 5 2 = 25 parallelograms S(rj) for 
the MCP to the z-space H with dashed-edges, labeled by r/i and 772 along the margins inside 
the box. Right: the MC+ path (solid) as the entire solution set of (2.6) in the [3-space, along 
with the LASSO path (dashed). Data: ||xj|| 2 /2 = 1, xix 2 /2 = 1/4, (zx,z 2 ) = (1,-0.883) 
and p = 7 = 2. 
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Fig. 3. Plots for the same data as in Figure 2 with 7 = 1/2 for the MCP. Clockwise from 
the top left: the z-space plot with overlapping areas marked by multiple values of rjj ; the 
main branch and one loop as the entire MCP solution set of (2.6) in the /3-space, along with 
the LASSO; the segments of the main branch with r^ k 'z, fc = 0,l,2,3, representing tran- 
sitions rj = (p) — > (J) — > (p) — > (j^) — > (_? 2 )/ the loop with r^'z, k = 4,5,6,7 , representing 
transitions tj — (_° 2 ) — > — > (j^) — > (j^) — > (_° 2 ). For tj £ { — 2,0,2} p , z-segments turn 
into f3-pomts in the MC+ path. A topologically equivalent way of creating the main branch 
and loop is to fold a piece of paper twice parallel to the horizontal axis and then twice 
parallel to the vertical axis, cut through the fold and then unfold. 



f T (x) 



(3.8) 



b(«) 

;(*) 



b to Affi/3: 

; - x)(t^-^ © b^" 1 )) + (x-k + l)(rW © b^), 
k — 1 < x < k, 

t \^ © 3 W = (1 © bW)/r( x \ 0<x<k*, 

with the initialization rj^ = b^ = and = l/maxj< p \zj\. The PLUS 

path ends at step k* if (3 provides a global least squares fit with X' (y — 

X3 (fc * } ) = 0. We define fi [x) = p {k * ] and A^ = (k*/x)\( k *1 for x > k* . Clearly, 
x interpolates the number of steps in [0, k*]. 

We compute the turning points © in (3.8) by finding the "state" 
rj( k \ the slope s( fe ) = (db^/dr^ ) , k — 1 < x < k, the sign ^ = sgn(r( fc) - 
r^'" 1 )) and the "length" A( fe ) = - r^ -1 ^ for the new segment. We 
now provide algebraic formulas for the computation of these quantities in a 
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certain "one-at-a-time" scenario. We prove that the PLUS path is one-at-a- 
time almost everywhere in (X, y) in the next subsection. 

At r^ 1 ) ffib^ 1 ), (3.8) must hit one of the inequalities in (3.5) for a 
certain index 

^-i) G{i: |^-D| G{tl5 ... )U 

(3.9) 

with r]f~ 1] ^ 0, or |t ( * _1) ?j - x'j^'^l = 1}, 

where t\,...,t m are the knots of (3.1). If is unique, r)j = r]j k ^ for 

j 7^ j'^'" 1 ) and 

im (*> - / BguM*- 1 ^- -Xi-b^), ^ =0, 
L '+sgn(&] ' -b) '), rf- / 0, 

for j = j( fe_1 ). Let be as in (2.4) and A(rj) = {j :rjj 7^ 0}. Define 

£(77) = £ X(fj)> Q(r7) = E(tj) - diag^(r ?i ), / 0), 

(3.11) 

d(r7) = |A(r?)|. 

Since the Xj in (3.3) are the columns of S, for k — 1 < x < k the first 
equation of (3.5) can be written as Q^^P^^b^) = P(r)^)(r^z - 
sgn(?7^ c ^)ii(T7^ c ))), where P(/?) is the projection b — > (bj,r]j 7^ 0)' and u(-) is 
as in (3.4). Differentiating this identity, we find 

(3.12) Q(r 7 ( fc ))P(r ? W)s( fc )=P(r 7 W)i, rjf ) = ^ # )=0 ' 

so that s^ fc ^ is solved by inverting Q(rj^). If the segment £(77^) |z) does not 
live in the boundary of S(rj^), the path has to move into its interior from 
side j^ -1 **, so that 



(313) ^ )= uf-v?- i) ) S Ms?\ iyo,j=j^ 

\ r/f _1) sgn(V.s( fc ) - zj), rjf ] = 0, j = jC* -1 ) 



Given the slope s( fc ) and the sign of (ir for the segment, there are at 
most p possible ways for (rz) © b(rz) to hit a new side of the boundary of 
the p-parallelepiped 5(77^) in (3.5). If it first hits the boundary indexed by 
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,(*) 



by (3.5) and (3.8) A^) would be 



(3.14) 



A 



(k) 



ef {t(^+i)-ij-^}/ fl } 



/„(*) 



t(*0 ( fc ) ^ n ^ ( fc ) 



(fc). 



(fc-i) 



e W {1 _^-; }/{? ._ x / sW}; 

(fe) 



tCO (fe) ^ n / (*0 
,(*-!) 



d fc) (%-x's( fc ))>0 



tf{-i-gr L, y&-x!i* w }> 



^ fc) (%-x;s( fc ))<o=7 ? 



where t(-) is as in (3.4) and ^ 1} = r^ -1 ^- - x^b^" 1 ). It follows that 



(3.15) T W =T (k-i) + ^(k) A (k) A^= min A, 

i<i<P 3 



(fc) 



with the minimum attained at jf 
PLUS as follows. 



as in (3.9). We formally write the 



The PLUS Algorithm. 

Initialization: r/ ) <- 0, b^ <- 0, r (0) i- 1/ 

Iteration: 

(3.16) Find t/^ by (3.9) and (3.10), 

(3.17) Find s (fe) by (3.12), 

(3.18) Find r (fc) by (3.13), (3.14) and (3.15), 

(3.19) b« <- b^ 1 ) + (r« - r( fc " 1 ))s( fc ), 
fc-<- fc + 1. 

Termination: (3.16) has no solution for k = k* + 1 or r( fc *) = oo. 
Output: r(°) , b(°) , ?/(*) , sW , r W , bW , jfe = 1, 2, . . . , k* . 



3.3. T/ie existence and uniqueness of the PLUS path. We prove in this 
subsection that for the MCP the PLUS algorithm computes the main branch 
(2.7) of the solution graph of (2.6) and that the main branch is unique almost 
everywhere in (X, y). 

Nondegenerate designs. The design matrix X in (1.2) is nondegenerate 
if for all Ac {1, ... ,p} of size \A\ = n Ap — 1 and rjj E {—1,0, 1}, j < p, the 
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n Ap vectors 

(3.20) < Xj,j G A, r/fcXfc > are linearly independent. 

^ k^A ' 

Forp<n, X is nondegenerate iff rank(X) =p. 

Theorem 3. Suppose the MCP is used in the PLUS algorithm. Let 
Q(r/ fc )) be as in (3.12). 

(i) Suppose the design matrixX. is nondegenerate in the sense of (3.20). 
Given X, there exists a finite set To(X) such that for all 7 ^ To(X), a path 
of the form (3.8) exists with det(Q(r]^)) 7^ for k < k* and perfect fit 

X'(y — X/3^ ) = at a finite final step k* . 

(ii) For fixed 7 > 0, the design matrix X is nondegenerate and 7 ^ To(X) 
almost everywhere in W lXp under the Lebesgue measure. 

(iii) For fixed positive 7^1, the design matrix X is nondegenerate and 
7 ^ To(X) almost everywhere under the product of p Haar measures in the 
(n — l)-sphere {x : ||x|| 2 = n}. 

(iv) Suppose 7 ^ To(X). Then, almost everywhere in z = X'y/n £ W, 
the graph of (2.7) is unique and the PLUS algorithm computes (2.7) within 

a finite step k* and ends with an optimal fit satisfying X'(y — X/3 ) = 0. 
Consequently, for all <k <k* the path (3.8) is one-at-a-time in the sense 
of (a) the uniqueness and validity of (3.9), (3.10), (3.12), (3.13) and (3.15) 
and (b) the positiveness of and in (3.15). 

(v) If Q(rj^) is positive- definite and £(rj^\z) in (3.6) does not live in 

the boundary of S(r} k >) in (3.5), then (3 is a local minimizer of L(b;A) in 
(1.1) at \ = \ { - x \ k-l<x<k. 

Theorem 3(ii) and (iii) ensure that 7 ^ To(X) almost everywhere in X for 
all fixed {n,p, 7}. The condition of 7 ^ To(X) is not necessary for the MC+ 
path to end with an optimal fit. For example, if Xj = ±Xfc , the PLUS path 
uses at most one design vector Xj or x^,, in any step, so that it behaves as if 
one of them never exists. Theorem 3(iv) guarantees that the PLUS algorithm 
yields an entire path of solutions (2.7) covering all < A < 00. Theorem 3(v) 
implies that the estimator /3(A) is a local minimizer under (2.5) whenever 
#{j : /3j(A) 7^ 0} < d* , as guaranteed by the conditions of Theorems 1, 2, 5 
and 6. For simplicity, we omit an extension of Theorem 3 to the PLUS with 
general quadratic penalty (3.1). 

We note that the map — > (3 ^ is potentially many-to-one in the PLUS 
path due to the possible concavity of the penalized loss, since < r^ _1 ) 
is allowed as (3.8) traverses through the solution graph. Theorem 3 does not 




Fig. 4. The same type of plots as in Figures 2 and 3 for the same X and more sparse 
(zi,Z2) = (1,-1/2). From the left: the z-space plot for MC+ with 7 = 2; MC+ with 
7 = 1/2; the MC+ (same for both 7 = 2 and 7 = 1/2) and LASSO paths in the (3-space. 
The loop disappears since the solid line rz does not pass through the places where the 
projection of S folds in two different directions. 



guarantee that the PLUS path contains all solutions of (2.6) due to loops 
outside its path, as Figure 3 demonstrates. However, such multiplicity of 
branches is less severe for sparse data. In the example in Figure 4, the convex 
penalized loss with 7 = 2 yields identical MC+ path as the nonconvex one 
with 7 = 1/2 for sparse data outside regions where the the projections of the 
parallelograms S(rf) fold severely in the z-space for 7 = 1/2. This should be 
compared with the dramatic difference between 7 = 2 and 7 = 1/2 in Figures 
2 and 3 for dense data. 

4. Selection consistency for general penalty. We provide in Section 4.1 
two sets of lower bounds for the probability of correct selection for general 
penalized LSE: one for the global minimizer of (1.1) in the regular case 
of rank(X) = p {p <n necessarily) and one for the local solution (2.8) in 
the case of rank(X) <p (including p^>n). These lower bounds imply the 
sign consistency P{sgn(/3) = sgn(/3)} — > 1 and thus the selection consistency 
(1.4) as max(ra, p) —> 00. As a crucial element of our proof and a matter of 
independent interest, we also provide in Section 4.2 upper bounds of the 
false positive for any given oracular set B of interest and a general class of 
penalties. 

4.1. Probability bounds for selection consistency. Our selection consis- 
tency results are proved by showing that the global minimizer of (1.1) or 
the local solution (2.8) are identical to the oracle LSE (2.10) with high 
probability. Let 

(4.1) (wj,j G A )' = the diagonal elements of S^„, 

so that Var(/3°) = w°a 2 /nVj G A° for the oracle LSE 3° with B = A°. We 
first present nonasymptotic bounds for selection consistency under the fol- 
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lowing global convexity condition: 

(4.2) c min {V) + {p{t 2 ;X)-p(t 1 ;X)}/(t 2 -t 1 )>0 VO < t x < t 2 , 

where 5] = X'X/n. Under (4.2), (2.6) is the KKT condition and its solution 
is unique, so that the estimator (2.8) is the global minimizer of (1.1). Let 
$(•) be the N(0, 1) distribution. 

Theorem 4. Suppose (2.3) and (4.2) hold for Ai < A < A 2 . iet 3(A) 
6e as in (2.8) for each A > and (3 = /3(A) for a deterministic or random 
penalty level A. Let A° , d°, A and /3* = ming^ |/3j| be as in (1.3), (1.4) o- n d 
(2.9) and 3° be as in (2.10) with B = A° . Suppose f3* > 7A2 and P{A X < 
A< A 2 } = 1. Then 

(4.3) P{A^A°}<P0^P° or sgn(3)/sgn(/3)}<x rail (Ai) + ^, 2 (A 2 ), 

where 7T n; i(A) = 2 J2j£A° ^ > (- n V( cr ll x j ID) and-K n ^{\) = Y,jeA° ®((lX- 
(a^/n) 1 /*)). 

Corollary 2. Suppose (2.3), (4-2), ||xj|| 2 = n and |/3,| > 7A + cr x 
yjw°(2/n) log a n /or a// j G ^4° mtt a n > d° and A > Ai^ = er x 
-y/ (2/ra) log(p — d°). Then, for large y/nX/a and a n , 

(4.4) P{3(A) / 3° or sgn(3(A)) + sgn(/3)} -> 0. 

For the MC+, (4.2) is equivalent to c m i n (S) > I/7, and /3* > ( , y + Vw°)X- a mv 
with p — > 00 suffices for (4.4), where n>° = maxjg^o w° is as in Theorem 1. 
For the SCAD, we need the larger 7 > 1 + l/c min (5]) for (4.2). For d° <p 
and ||xj|| 2 = n, (4.4) provides theoretical support to the heuristic condition 
(2.9) for the selection consistency at A = A un i v . 

We now consider selection consistency for general p, including p> n. For 
c* > c* > k > and < a < 1, define w = u> c »,c*,K,a = (2 — a)/(c*c* / k 2 — 1) 
and 



(4.5) 

. (1 + w{l + (a/t)/(l - q)})cVc, - 1 

o<i<(2/i«+i+o)/a {2 + u;(l + a — ia)}(l — a) 



Theorem 5. Let p(t;X) be a penalty satisfying p(0+;X) = X, p(t;X) < 
XI {t < 7A} and p(t; A) > -k for allt>0 and X > Ai . Lei A°, d°, 3 = 3(A), 
A, (3 , Wj, 7r nj i(A) and 7r„ )2 (A) be as in Theorem 4- Suppose (2.11) holds 
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Table 2 

Example configurations of {c* ,c*,K, a} for fixed K, c* = 1 + 5, c* = 1 — S, optimal 
t = y/(c~/c.)/K./(l-a) m (4.5) 





K, = 


1/2 




K * — 


1 




K * — 


2 




K * — 


3 


s 


a 


1/k> 


S 


a 


1/k> 


S 


a 


1/k> 


8 


a 


1/k> 


1/4 


1/5 


4.84 


2/5 


1/5 


4.14 


1/2 


1/3 


3.30 


1/2 


1/2 


2.98 


1/5 


2/7 


3.73 


1/3 


1/3 


3.57 


1/3 


1/2 


2.32 


1/3 


1/2 


1.73 


1/6 


1/3 


3.28 


1/4 


2/5 


2.65 


1/4 


1/2 


1.86 


1/4 


1/2 


1.49 



with certain rank d* and c* > c* > «. For i/iese {c*,c*,k} and < a < 1, 
/et A* be as in (4-5). Suppose (1.2) holds with d° < d* = d*/(l + A'*). Let 
vrn, 3 (A) = { p -f)P{or 2 Xm > mX} with m = d*-d°. 



(i) Let A2 > max{Ai, (vc*/a)A3}. Suppose /3* > 7A2 and P{Ai < A < 
A 2 } = 1. T/ien 

3 

(4.6) P{l^°}<P{3/3° or sgn(3)/sgn(/3)}<^vr nifc (A fc ). 

fc=i 



(ii) Let Xi >e = o~\J (2/n)log((p — d°)/e), X^^ = 0\J (2/n) logp e ™^ Pe mi 
(2.12), A2 j£ > max{Ai j£ , (\/c*/a)A3 i(E } and a n > d°. Suppose |/3j| > 7A2, e + 
aJw?(2/n)log(an/e) for j G A° and || Xi || 2 = n. // P{A lj£ < A < A 2 ,J = 1, 



then 

P{A / A°} < P0 / 3° or sgn(3) / sgn(/3)} 

f471 < f 1 d /(2a ra ) (41ogp 6 )-V 2 

1 ' J -\lVJi lVJ 2 IVJ3 



with J\ = y/-K\og({p — d°)/e), J3 = {21ogp e — 1 + l/m}^mir/ y / 41ogp e and 
J2 = log(a n /e). Consequently, (1-4) holds as e _1 V min(Ji, J2, J3) — > 00 
and P{Ai<A<A 2 }^l. 

Remark 5. A convenient choice is a = 1/2 and t = 3 in (4.5) which 
leads to A» < {1 + 2/(c*c*/k 2 - l)}c*/c* - 1. In Theorems 1 and 2, 1/k = 

7 > c^V 4 + c */ c *' so that K * ^ c */ c * - V 2 - For tne LASSO, k = = w 
and A* = (c*/c* — l)/(2 — 2a). Some other configurations of {c*,c*,K,a} 
are given in Table 2. 
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Remark 6. Theorem 5(i) is applicable to the problem of finding a sparse 
solution /3 of y = X/3 with p> n, i.e., e = in (1.2). With A2 = \^ k *^ (nearly 

zero) and cr = Ai = A3 = a = 0, it asserts (3 = (3 at the last step of the 
PLUS algorithm whenever /?* > 7A( fc *) and d° < d* /{K* + 1), where + 1 = 
(c*/c* + l)/{2 - k 2 /(c*c*)}. See Section 6.5. 

Remark 7. Consider the MC+ and LASSO. For /3* > 7A univ , the oracle 
r(z © /3 ) has a high probability of solving (3.5) for the parallelepiped S(rj) 
with 77 = 2sgn(/3). Such parallelepipeds are unbiased, since they involve re- 
gions with u(±2) = v (±2) = = P2(|&j|) i n (3.5). An extension of Theorem 5 
to biased S{rj) requires sgn(/3j)(l + I{\{3j \ > 7A}) = rjj with a larger A. Such 
an extension with maxj|/3j| < 7A and rj = sgn(/3) would match the theory of 
selection consistency for the LASSO with uniformity in a neighborhood of 
7 = 00. 

Compared with Theorem 4, an obvious advantage of Theorem 5 is its 
applicability to p > n. In the case of p < n, Theorem 5 still allows c* > 
c m i n (5]) and thus smaller 7= 1/k and /?* for the MC+ than Theorem 4 
does. With k = I/7, the MCP allows the smallest 7 and thus the smallest 
possible /3* in Theorem 5. 

4.2. An upper bound for the false positive. Given a target set B C {1, . . . ,p}, 
we provide upper bounds for the false positive #{j ^ B : |/3j(A)| > 0} for the 
selector (2.8) with a general class of penalties. See Remark 4 for examples 
of B. 

Theorem 6. Suppose (2.11) holds with certain d* and c* > c* > k > 0. 
For these jc*, c*, ft} and an a £ (0,1), let be as in (4-5). Let B be a 
deterministic subset of {1, ... ,p} with \B\ = d° < = d* /(K* + 1). Let \± > 
0. Suppose p(t;X) satisfy A(l — Kt/X) + < p(t;X) < A for t > and A > Ai. 
Let A > Ai V {{\fd* /a){ayj (2/n)logp e + Ob/Vtu)} with the 6b in Theorem 
2, m = d* -d° and p e in (2.12). Let ft = 3(A) with the 3(A) in (2.8). Then 

P{#<jtB:%j>G)>\V{K*\B\)} 

(4.8) 

< e{\ogp e )- l/2 e^ /2 ^{-p) < e/y/2, 

where \x = {21ogp e — 1 + l/m}y/rn/ a/2 log p e and $>(x) is the N(0, 1) distri- 
bution function. 

This theorem is an extension of the upper bound on \A\ in Zhang and 
Huang (2008) from the LASSO to a general continues path of penalized 
LSE. Since it is relatively easy to find sharp conditions for the oracle LSE 
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(2.10) to be a solution of (2.6), the upper bounds in Theorem 6 is a crucial 
element in our proof of selection consistency. Remark 5 applies to Theorem 
6. 

5. The MSE, degrees of freedom and noise level. In this section, we con- 
sider the estimation of the estimation and prediction risks for general penal- 
ized LSE and the noise level in (1.2). Formulas for the degrees of freedom and 
unbiased risk estimators are derived and justified via Stein's (1981) unbiased 
risk estimation (SURE). Necessary and sufficient conditions are provided for 
the continuity of the penalized LSE. 

5.1. The estimation of MSE and degrees of freedom. The formulas de- 
rived here are based on Stein's (1981) theorem for the unbiased estimation 
of the MSE of almost differentiable estimators of a mean vector. A map 
h : R p — > W is almost differentiable if 

(5.1) h(z + v) =h(z) + I J H(z + xv)dxjv Vv e R p , 

for a certain map H : W — > MP xp . Suppose in this subsection that p(t;X) is 
almost twice differentiable in t > 0, or equivalently 

(5.2) p(t; A) = ^p(t; A) = p(l; A) + J p(x; A) dx V< > 0, 

for a certain function p(x; A). Under this condition, p(t; A) = (d/dt)p(t; A) al- 
most everywhere in (0, oo) and the maximum concavity (2.2) can be written 
as k( P ;X) = [|Cp(*5 [|oo- 

For multivariate normal vectors z ~ 7V(/i,V), Stein's theorem can be 
stated as 

(5.3) £h(z)(z-/i)' = £H(z)V, 

provided (5.1) and the integrability of all the elements of H(z). This applies 
to the penalized LSE. Let be as in (2.4). We extend (3.11) to general 
penalties p(t; A) as follows: 

Q(/3; A) = £ {J: ^o} + diag(p(|^|; A), ft ^ 0), 

(5.4) 

d(/3) = #{j:/3^0}. 



Theorem 7. Let A > be fixed and (3 = /3(A) = argminbL(b; A) with the 
data (X,y) in (1.2) and L(b;A) in (1.1). Suppose (2.5) holds with d* =p. 
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Let SI = X'X/n and P be the d{(3) x j> matrix giving the projection Pb = 
(bj : / 0)' as in (3.12). Then 

E0-(3)0-(3)' 

(5 " 5) 2,2 

= E\0- (3)0 - (3)' + H£_p' Q -i(3; A )P 1 - ^S- 1 , 
[ n J n 

where (3 = 5] _1 X'y/n is i/ie ordinary LSE of f3. In particular, for all a £ K p , 

(5.6) |a'(3 - 3)| 2 + — (Pa) / Q- 1 (3; A)(Pa) - — a'E^a 

n n 

is an unbiased estimator of the MSE E\sl{(3 — (3)\ 2 , provided a 2 = a 2 in 
the case of known a 2 or a 2 = ||y — X/3|| 2 /(n — p) in the case of p < n. 
Consequently, 



(5.7) 



e{ ||3 -f3\\ 2 + — trace(Q~ 1 (3; A)) - — trace^" 1 ) ) = E0 - (3\\ 
I n n J 



2 



Remark 8. Condition (2.5) with d* = p asserts c m i n (X!) > n(p; A), which 
is slightly stronger than the global convexity condition (4.2). We prove in 
the next subsection that (4.2) is a necessary and sufficient condition for 
the continuity of (3, which is weaker than the almost differentiability of (3. 
Thus, the conditions of Theorem 7 are nearly sharp for the application of 
the SURE. In the fcth segment of the PLUS path, Q (3(A); A) = Q(v {k) ) as 
in (3.11). 

Let [i = Ey = X/3 and /i = X/3 with the penalized LSE in Theorem 7. Let 
Jl and p, be the orthogonal projections of y to the linear spans of {xj, j < p} 
and {xj , /3j ^ 0} , respectively. For uncorrelated errors with common variance 
a 2 , the degrees of freedom for fi,° is Y^j=\ Cov(/tj, Jl°)/a 2 = rank(xj : /3j ^ 0). 
Thus, since E\\J1 — fi\\ 2 = a 2 rank(X) and \\fi- — n\\ 2 + \\J1 — /x|| 2 — \\J1 — fi<\\ 2 = 
2(M - /!)'(£-/*), 

(5.8) df(P)^t52%M = i E ( raA( x)-J^ + Mziil!) 

(J \ (J (J ) 

J=l 

extends the notion of degrees of freedom. This also provides the C p -type risk 
estimate 

(5.9) C p = dp(X) = ||/x - fif + d 2 {2df - rank(X)} » ||/2 - /z|| 2 . 
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Theorem 7 suggests the unbiased estimator for the degrees of freedom (5.8) 
as 

(5.10) off = df (A) = trace(Q- 1 (3; A)P£P') 

and the related C p -type estimator of the MSE E\\fl — fJ-\\ 2 via (5.9). We 
refer to Efron (1986) and Meyer and Woodroofe (2000) for more discussions 
about (5.8) and (5.9). We will present in Section 6 simulation results to 
demonstrate that (5.9) provides a reasonable risk estimator. The following 
theorem asserts the unbiasedness of (5.8) and (5.9). 

Theorem 8. Suppose (2.5) holds with d* =p. Then, the SURE method 
provides unbiased estimators for the degrees of freedom and the £2 r isk for 
the estimation of the mean vector, 

(5.11) E(df) = df (fi), EC P = E\\fl - fi\\ 2 , 

in the linear model (1.2), where df(/i), df and C p are, respectively, given 
by (5.8), (5.10) and (5.9), the a 2 in (5.9) is as in (5.6), and £t = X/3 is as 
in Theorem 7 with a fixed A. Furthermore, if p(t; A) = Xt for the LASSO or 
> 7A for all /3j / under (2.3), then 

(5.12) df = #{j :^0}. 

Under a positive cone condition on X, Efron et al. (2004) proved the 
unbiasedness of #{j : j3j 7^ 0} as an estimator for the degrees of freedom for 
the LARS estimator (not the LASSO) at a fixed step k. Our definition of 
the degrees of freedom and C p is slightly different, since we use \\J1 — fl\\ 2 
and rank(X) in (5.8) and (5.9) for variance reduction, instead of ||y — p,\\ 2 
and n. We prove E#{j ■ Pj + 0} = df (fi) for the LASSO for fixed A without 
requiring the positive cone condition, but not for fixed k with a stochastic 
A. The performances of C p for the LASSO and MC+ are similar in our 
simulation experiments. 

5.2. Estimation of noise level. Consider throughout this subsection stan- 
dardized designs with ||xj|| 2 =n for all j <p in (1.2). We have shown in 
Theorem 1 and Table 1 that the MC+ at A un i v = cry (2/ra) logp works well 
for variable selection. In practice, this requires a reasonable estimate of the 
noise level a. For p < n, the mean residual squares ||y — /I|| 2 /{n — rank(X)} 
for the full model provides an unbiased estimator of a 2 as in Table 1, where 
Jl is the orthogonal projection of y to the linear span of < p}. However, 
the estimation of a 2 is a more delicate problem for p > n or small n — p> 0. 
Here, we present a simple estimator of a 2 in such cases based on Theorem 
8. 
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Since (2.8) provides estimates /x(A) = X/3(A) of the mean fi = X/3, we 
may use 

(5.13) a 2 (A)^||y-/i(A)|| 2 /{n-df(A)} 

to estimate a 2 , with the df(A) in (5.10) as an adjustment for the degrees 
of freedom. Still, good a 2 (A) requires a consistent ji(A), which depends on 
the choice of a suitable A of the order o\J (logp) jn. This circular estimation 
problem can be solved with 

(5.14) a = a(\), A = min{A > A* : ct 2 (A) < nA 2 /(r logp)}, 

for suitable r$ < 2 and^A* > 0. Here, A* could be preassigned or determined 
by upper bounds on df(A) or the dimension #{j:/?j(A) 7^0}. In principle, 
we may also use in (5.14) estimates <r 2 (A) based on cross-validation or boot- 
strap, but the computationally much simpler (5.13) turns out to have the 
best overall performance in our simulation experiments. 

5.3. Convexity, continuity and almost differentiability. Here, we consider 
the continuity and almost differentiability of a penalized LSE (3, which the 
proof of Theorems 7 and 8 require. 

The continuity of (3, demanded by Stein (1981), is a property of indepen- 
dent interest on its own right for robust estimation [Fan and Li (2001)]. For 
full rank designs, we provide here the equivalence of the continuity of the 
penalized LSE and the global convexity condition (4.2). We have considered 
(2.3) for unbiased selection. For the continuity of (3, we only need 

(5.15) lim p(t;X)/t 2 = 0, < p(0+; A) < oo. 

t— »oo 



Theorem 9. Let A be fixed. Suppose p(t; A) is continuously differ entiable 
int>0, (5.15) holds, and rank(X) =p. Then the following three statements 
are equivalent to each other: 

(i) The global minimizer (3 of (1.1) is unique and continuous in y G M. n . 

(ii) The global convexity condition (4-2) holds. 

(iii) The penalized loss L(b; A) in (1.1) is strictly convex in b G MP. 

For p> n, an implication of Theorem 9 is the continuity of solution f3 of 
the estimating equation (2.6) subject to {j : f3j / 0} C A for all fixed A and 
A with \A\ <d*, provided the sparse convexity (2.5). Thus, minimizing the 
maximum concavity allows the broadest extent for such sparse continuity 
of solutions of (2.6). The most difficult part of the proof of Theorem 9 is 
(i) =>■ (ii), which is done by showing (x,x, . . . , x)' = xl is in the range of 
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(3 for all x > 0. Since the penalized loss attains minimum at (3, Q(/3; A) in 
(5.4) is positive definite for smooth penalties, and the positive-definiteness 
of Q(il; A) gives c min (£) > p(t; A). 

The application of SURE in Theorems 7 and 8 also requires the almost 
differentiability of (3. In the following proposition, we establish the stronger 
Liptchitz condition for (3 under the conditions of Theorem 7. 

Proposition 2. Let A and X be fixed and treat (3 in (2.8) as a function 
of y. Suppose (2.5) holds with d* = p. Then (3 = h(z) for z = X'y/n S MP 
and a certain almost differentiate function \i:MP — >MP, such that for all z 
and v in MP 



where Q is as in (5.4) and P(/3; A) :b — > (bj :(3j ^ 0)' is as in (3.12). Con- 
sequently, h(z) satisfies the Lipschitz condition ||h(z + v) — h(z)|| < ||v||/ 
{c min (S) - k(p; A)}. 

6. More simulation results. In this section, we present simulation results 
along with some discussion on the performance of the LASSO, MC+ and 
SCAD+ in selection consistency and estimation of (3 and fx = X/3, sparse 
recovery, the computational complexity and the scalability of the PLUS 
algorithm, the choice of the tuning parameter 7, the estimation of the noise 
level a and the risk, and the sparse Riesz condition. 

6.1. Selection consistency. For the MC+, the tuning parameter 7 reg- 
ulates its computational complexity and bias level. We study its effects 
through three experiments, say experiments 1, 2 and 3, including cases where 
7 is smaller than the "smallest" theoretical value 1/(1 — maxj^f s \ x 'j x k\/ n ) 
with d* = 2 in (2.5) and A < A un i v . 

Experiment 1, summarized in Table 1 in Section 2, illustrates the superior 
selection accuracy of the MC+ for sparse (3, compared with the LASSO 
and SCAD+. Experiment 2, summarized in Table 3, shows the effects of 
the regularization parameter 7 on selection accuracy and computational 
complexity of the MC+. Experiment 3, summarized in Table 4, demonstrates 
the scalability of the PLUS algorithm for large p. The design matrix X 
has the same distribution in experiments 1 and 2. For each replication, we 
generate a 300 x 600 random matrix as the difference of two independent 
random matrices, the first with i.i.d. unit exponential entries and the second 
i.i.d. Xi entries. We normalize the 600 columns of this difference matrix to 
summation zero and Euclidean length y/n. We then sequentially sample 
groups of 10 vectors from this pool of normalized columns. For the mth 
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Table 3 

Performance of MC+ with different 7 in experiment 2 100 replications, n — 300, p = 200, 
d° = 30, p* =3/8, LASSO for 7 = 00 CS = I{A = A"}, SE,g = ||/3 - /3|| 2 , 
SE M =||X(3-/3)||, K = #(steps) 





7 


1.01 


1.4 


1.7 


2.0 


2.4 


2.7 


3.0 


5.0 


00 


Auniv 


CS 


0.81 


0.82 


0.66 


0.53 


0.34 


0.35 


0.27 


0.11 


0.00 


= 0.188 


SE/3 


0.136 


0.128 


0.265 


0.495 


0.729 


0.801 


0.817 


1.007 


1.420 




SE„ 


0.117 


0.112 


0.205 


0.358 


0.510 


0.564 


0.583 


0.761 


1.123 




k 


561 


98 


62 


47 


36 


33 


32 


32 


34 


A 


A 


0.195 


0.182 


0.175 


0.164 


0.164 


0.158 


0.169 


0.188 


0.201 


for maxCS 


CS 


0.83 


0.82 


0.75 


0.63 


0.46 


0.36 


0.27 


0.11 


0.02 




fc 


561 


98 


65 


57 


45 


40 


34 


32 


33 


A 


A 


0.182 


0.175 


0.153 


0.138 


0.120 


0.108 


0.101 


0.094 


0.050 


for minSE^ 


SE/3 


0.132 


0.117 


0.119 


0.124 


0.133 


0.140 


0.149 


0.255 


0.394 




k 


562 


98 


68 


64 


65 


67 


68 


47 


84 


A 


X 


0.182 


0.175 


0.153 


0.138 


0.120 


0.108 


0.101 


0.094 


0.050 


for minSE M 


SE„ 


0.115 


0.104 


0.106 


0.110 


0.117 


0.124 


0.130 


0.201 


0.278 




k 


562 


98 


68 


64 


65 


67 


68 


47 


84 



group, we sample from the remaining 610 — 10?n columns one member as 
x i0m-9 an d 9 more to maximize the absolute correlation |x^xio m _g|/n, j = 
10m — 8, ... , 10m. In experiment 3, X are generated in the same way for 
each replication with groups of size 50 from a pool of 6000 i.i.d. columns. In 
all the three experiments, (3j = ±/3* for j G A° and e ~ N(0,I n ). 

Strong effects of bias on selection accuracy is observed in all three ta- 
bles. In Table 1 where /3* « VTOAuniv) the selection accuracy of the LASSO 
clearly deteriorates as d° increases. In Tables 3 and 4, the unbiasedness 
criterion /3* > 7A un i v in (2.9) matches the best selection results well, with 
l-7A un i v < ft* < 2A un i v in Table 3 and 2A un iv < P* < 2.4A un i v in Table 4. In 
Table 1, 7A un i v /o" w 1/2 = P*/a, but slightly larger A yields the largest CS. 
Comparison between the results for A un i v and argmax^CS in all three ta- 
bles demonstrates that A un i v is a reasonable choice for variable selection with 
||xj|| 2 = n, especially when P* is near the minimum for accurate selection as 
in Tables 3 and 4. 

An interesting phenomenon exhibited in experiments 2 and 3 is that the 
observed selection accuracy CS is always decreasing in 7. Despite the com- 
putational complexity for small 7, the MC+ still recovers the true A° among 
so many parallelepipeds it traverses through. This suggests that the interfer- 
ence of the bias, not the complexity of the path or the lack of the convexity 
of the penalized loss, is a dominant factor in variable selection. Of course, 
bias reduction does not always provide accurate variable selection. When the 
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signal is reduced to /3* = 1/4 from /3* = 3/8 in experiment 2, the selection 
accuracy suddenly drops to CS < 0.11 for all values of (A, 7). 

6.2. Estimation of regression coefficients and the mean responses. Tables 
1, 3 and 4 also report results for the estimation of regression coefficients 
P with the square error SEp = - /3|| 2 . The MC+ and SCAD+ clearly 
outperform the LASSO in these settings. In Table 4, the minimum SE^ for 
the SCAD+ are 2.5% and 6.2% smaller than the MC+ with matching 7 = 2.4 
and 2.7, respectively, while those of the MC+ are 14% and 16% smaller than 
the SCAD+ with matching maximum concavity n(p) (7 = 1.4 and 1.7 for 
the MC+ versus 7 = 2.4 and 2.7 for the SCAD+, respectively). The SCAD 
penalty requires 7 > 2. The results for the SCAD+ in experiment 2 are 
not reported since they show a similar pattern as experiment 3. Results 
for the estimation of the mean fi = X/3 with the average squared error 
SE^ = 1 1 X/3 — X/3|| 2 /n are similar to those for the estimation of /3 in Tables 
3 and 4. 

6.3. Computational complexity and choice 0/7. As expected, we observe 
in Tables 3 and 4 that the MC+ with smaller 7 is computationally more 
costly. Dramatic rise in the number of needed PLUS steps is observed when 
7 decreases to 1/2 in experiment 2. We avoid 7 = 1, since it produces the 
singularity det(Q(ry)) = for (3.12) whenever X3^=i l 7 ?^ I = 1 f° r the MC+ 
with the standardization ||xj|| 2 = n. 



Table 4 

Performance of MC+ and SCAD with p> n in experiment 3 100 replications, n = 300, 
p = 2000, d° = 30, = 1/2, SCAD+ for 7*, LASSO with 7 = 00 





7 


1.4 


1.7 


2.0 


2.4 


2.7 


2.4* 


2.7* 


00 


Auniv 


CS 


0.99 


0.99 


0.96 


0.80 


0.56 


0.00 


0.00 


0.00 


= 0.225 


SE/3 


0.109 


0.116 


0.205 


0.534 


0.712 


2.703 


2.764 


2.640 




SE„ 


0.098 


0.103 


0.170 


0.395 


0.515 


1.602 


1.661 


1.785 




k 


119 


76 


62 


46 


41 


130 


84 


56 


A 


A 


0.241 


0.225 


0.225 


0.225 


0.210 


0.177 


0.171 




for maxCS 


CS 


1.00 


0.99 


0.96 


0.80 


0.60 


0.08 


0.02 


0.00 




k 


118 


76 


62 


46 


44 


255 


169 




A 


A 


0.225 


0.203 


0.183 


0.165 


0.149 


0.134 


0.129 


0.069 


for minSE^ 


SE/3 


0.109 


0.112 


0.117 


0.127 


0.138 


0.124 


0.130 


1.292 




k 


119 


77 


69 


71 


76 


279 


200 


181 


A 


A 


0.225 


0.203 


0.183 


0.165 


0.149 


0.143 


0.134 


0.069 


for minSE M 


SE„ 


0.098 


0.100 


0.104 


0.112 


0.122 


0.112 


0.118 


0.563 




k 


119 


77 


69 


71 


76 


273 


197 


181 
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T5 1 2 4 .5 1 2 4 .5 1 2 4 



Fig. 5. The median of a 2 (X)/(nX 2 / log p) as a function of A/^/ (log p)/n G [2~ 3/2 ,4] 
feased on -700 replications. Left: experiment 4 with n — 300, p = 2000 and d° — 30. 
Middle and right: experiment 5 with high and low correlations, respectively, n = 600, 
p = 3000 and d° = 35. For 1/1.5 < r < 2 in (5.14), 5 2 (A)/(nA 2 / logp) « l/r matches 
A/i/ (logp) /n = y'ro reasonably well to provide <? 2 (A) « a 2 = 1. 27iis is especially the case 
for ro = 1 as indicated by the dotted lines. 



Table 4 shows that the PLUS algorithm scales well for p> n. Comparisons 
between Tables 3 and 4 demonstrate that for similar d° and SNR /3*/A un i v , 
the computational complexity of the MC+ is insensitive to p as measured 
by the average number of steps k. 

In practice, full implementation of the MC+ requires a specification of 
7 and possibly a stopping rule for large (n,p), say k — ^max A k , to al- 
low the algorithm to end before it reaches the perfect fit at k = k* . As we 
have discussed in the Introduction, large 7 provides computational simplic- 
ity but may harm selection consistency with larger bias. Our simulation 
results in Tables 3 and 4 demonstrate robust selection accuracy for smaller- 
than-necessary 7 > at the universal penalty level. Thus, the choice of 7 
should largely be determined by the available computational resources as 
long as the MC+ path reaches a sufficiently small A. In our simulations, 



0.8 0.9 1.0 1.1 1.2 



0.9 1.0 1.1 1.2 



0.9 1.0 1.1 1.2 



Fig. 6. Histograms of a at ro = 1 for the same simulations as in Figure 5 with respective 
means and standard deviations 0.971 ± 0.057, 1.033 ± 0.032 and 1.060 ± 0.039 from the 
left to the right. It turns out that the MSE for a is of the same order as n" 1 ^ 2 in these 
simulations. 
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kmax = 5000, and all replications failing to reach A < A*/ 1.2 occur only for 
unreasonably small 7 = 1/2, where A* is (much) smaller than the smallest 
reported penalty level in each experiment. Since o in (5.13) is based on 
the beginning segments of the PLUS path, we "know" whether the desired 
penalty level is reached. 

6.4. Estimation of noise. In Figures 5 and 6, we present simulation re- 
sults for the estimation of a in experiments 4 and 5 with the MC+ estimator 
/2(A) = X/3(A). In experiment 4, (n,p) = (300,2000), 7 = 1.7, /3* = 1/2, (3 is 
generated every 10 replications and X is fixed. Its configurations are oth- 
erwise identical to that of experiment 3 reported in Table 4. In experiment 

5, (n,p) = (600,3000), x, are normalized columns from a Gaussian random 

\k— 7 1 

matrix with i.i.d. rows and the correlation Uj^ = o~[ 2 among entries within 
each row, 7 = 2/(1 — maxj>fc |x^Xj|/n) as in experiment 1, the nonzero (3j 
are composed of 5 blocks of /3* (1, 2, 3, 4, 3, 2, 1)' centered at random multiples 
ji,...,j5 of 25, /3* sets ||X/3|| 2 /n = 3, e~iV(0,I n ), and {X,/3} are gener- 
ated every 10 replications. It has two settings: 07^ = 0.9 for high correlation 
and 07^ = 0.1 for low correlation. We set A* = {2 _3 (logp)/n} 1 / 2 in both 
experiments 4 and 5. 

Figure 5 plots the median of <r 2 (A) / (n\ 2 / logp) versus A/ yf [\ogp)/n in the 
simulations described above. Since all three curves cross the level cr 2 (A) / (nA 2 / 
logp) = 1 at approximately A/ y 7 (logp)/n = 1, the estimation equation (5.14) 
provides approximately the right answer a 2 ~ 1 for tq = 1. We solve (5.14) 
for individual replications and plot the histograms of a in Figure 6. These 
simulation results suggest that the MSE for a is of the same order as n" 1 ^ 2 
for sparse (3. 

6.5. Sparse recovery. Our variable selection theorems are applicable to 
sparse recovery in the noiseless case of a = as we mentioned in Remark 

6. Table 5 reports simulation results to show that the LASSO (y = 00) may 
miss up to about 45% of nonzero /3j, while the MC+ (7 = 3) still manages 
to recover the true (3. For (n,p, d°\ ) = (100, 2000, 28) and (200,10,000,40), 
the LASSO does not capture most of the nonzero /3j before falsely selected 
variables manage to perfectly fit y = X/3 at the last step of the LARS, at 
the expense of substantially many additional computation steps. 

6.6. Estimation of risk. We summarize in Figure 7 the ^performance of 
C p in (5.9) for the MC+ in experiments 4 and 5, with the df in (5.10) and 
the a in (5.14). For each of the three settings, E'H/^A) — /x|| 2 and EC P (X) 
are approximated by the averages in 100 replications and the expected con- 
ditional variance E Var(C p (A)|X, (3) is approximated by the within-group 
variance, since (X,/3) is unchanged in every 10 replications in each of the 
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Fig. 7. Approximations of E\\p,(X) — /x|| 2 /n (solid) and 

EC p (\)/n ± 2{£ , Var(C p (A)/n|X,/3)} 1/2 (dashed) as functions of \/y/Qogp)/n for 
the MC+ based on the same simulations as in Figure 5. The MSE E\\(i(\) — /x|| 2 is 
reasonably approximated by C P (X) in these experiments with p > n, at least before the 
MC+ starts to over fit with small X. 

three settings. From Figure 7, we observe that the MSE E\\fi,(\) — /z|| 2 is 
reasonably approximated by C P (X) for p > n, at least before the MC+ starts 
to over fit with small A. 

6.7. The sparse Riesz condition. The SRC (2.11) and constant factors in 
Theorems 1, 2, 4 and 5 are quite conservative compared with our simulation 
results. Technically, this is probably due to the following two reasons: (i) 
the sparse minimum and maximum eigenvalues, or c* and c*, respectively, 
in (2.11), are used to bound the effects of matrix operations in the worst 
case scenario given the dimension/rank of the matrix; (ii) we use the con- 

Table 5 

Sparse recovery with MC+ at the last PLUS step k* . Entries o/X and nonzero (3j 
are i.i.d. #(0,1), e = 0; FN = #{j -.fif^ = / ft} 



(n,p,d°) (100,2000,15) (100,2000,28) (200,10,000,40) 



7 


3 


oo 


3 


00 


3 


00 


%{3 (fc * ) =/3}^ 


100 


51 


7:5 





100 





mean(FN |3 <fc ' / /3) 




2 


19 


13 




18 


^ (fc* ) 

mean(fc*|/3 = /3) 


32 


65 


87 




102 




mean(fc*|/3 7^/3) 




144 


513 


153 
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10 20 30 40 10 20 30 40 

dimension dimension 

Fig. 8. The mean (solid) of the minimum eigenvalue c m i n (X^XA/n) for a random set 
A of design vectors and the mean minus two standard deviations (dashed) as functions 
of the dimension \ A\, each point based on 100 replications, with horizontal dotted lines at 
n(p2) = 1/7 for 7 6 {1.4, 1.7, 2.652}. Left: the design X in experiments 1 and 2. Right: the 
design X in experiments 3 and 4- 

servative bound Cm^E j :7? ^o - diag(l/7, \r]j \ = 1)) > (^^(Sj.^.^o) - 1/7 
to ensure sparse convexity in the fcth segment r)^ k > of the MC+ path, but 
jf-{j : \r]j k ^\ = 1} could be much smaller than #{j :r]j 7^ 0}. These consid- 
erations suggest that the penalized loss (1.1) with the MCP (2.1) possesses 
sufficient convexity if 

(6.1) P*{c min (S A ) > k( P2 ) = lh\\A\ = d,X} w 1 

at a reasonable dimension d, where P* is the probability under which A is a 
random subset of {1, . . . ,p}. In practice, we may substitute the SRC (2.11) 
with (6.1) and a similar probabilistic upper bound on c max (5]A) under P* , 
which are weaker and much easier to check. Figure 8 plots the mean and a 
lower confidence bound of c m i n (5]A) under P* as functions of given d = \ A\. 
We observe that (6.1) holds for quite a few possible combinations of (d, 7) 
in our experiments, in view of Tables 1, 3 and 4. 

7. Discussion. We have introduced and studied the MC+ methodology 
for unbiased penalized selection. Our theoretical and simulation results have 
shown the superior selection accuracy of this method and the computational 
efficiency of the PLUS algorithm. We have provided an oracle inequality to 
demonstrate the advantage of the MC+ for the estimation of regression 
coefficients and proved its convergence at certain minimax rates in £ r balls. 
We have also discussed unbiased estimation of the risk, estimation of the 
noise level in the linear model in the case of p > n, and the necessary and 
sufficient conditions for the continuity of the penalized LSE. In this section, 
we briefly discuss the choice among multiple solutions in the PLUS path, 
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the one-at-a-time condition with the PLUS algorithm, the penalized LSE for 
orthogonal designs, adaptive penalty, general loss and sub-Gaussian errors. 

7.1. Choice among multiple solutions in the path. In (2.8), /3(A) is taken 

as the j3 when A^ first reaches a level no greater than A. An alterna- 

tive choice [Zhang (2007b)] is to pick /3(A) as the sparsest /3 in (2.7) with 
AW = A. Theorems 1, 4 and 5 holds verbatim for the sparsest solution, while 
Theorem 2 holds with a smaller <i* = <i*/(c*/c* + 3/2). Our simulation ex- 
periments yield nearly identical results among the two choices. A significant 
reason for using (2.8) is its simplicity in implementation since it does not 
require the entire path to compute /3(A) for given penalty levels A. 

7.2. The one-at-a-time condition with the PLUS algorithm. The formu- 
las (3.16)-(3.19) provide a simplified version of the PLUS algorithm dealing 
with the one-at-a-time scenario in which every intermediate turning point 
in the PLUS path is the intersection of exactly two line segments of posi- 
tive length. Although the one-at-a-time condition holds almost everywhere, 
numerical ties do occur in applications. When the one-at-a-time condition 
fails, the main branch (2.7) is a limit path of one-at-a-time paths, so that it 
is a graph with no dead end. The difference here is that when the PLUS path 
reaches a more-than-two-way intersection, say at step k, it must checked the 
indicators r]^\ < £ < k, to avoid infinite looping with the covered segments 
. The computational cost for checking the indicators is 0(k) if rj are effi- 
ciently coded, which is small compared with the cost 0{np) for finding the 
exit time (3.15). See Zhang (2007b) for details. 

7.3. Orthonormal designs and more discussion on penalties. For orthonor- 
mal designs x^x^/n = I{j = k}, the penalized estimation problem is reduced 
to the case of p = 1. For p(t; A) = X 2 p rn (t/X) with the quadratic spline penal- 
ties (3.1), 

(7.1) % = A6(x^y/(raA)) where b(z) = argmin{(z - bf /2 + p m {\b\)}. 

b 

For p = 1 and the MCP with k(p2) = I/7 < 1, the solution of (7.1) is 

b f (z) = sgn(z)min{|z|,7(|z| - A) + /(7 - 1)}, 

which turns out to be the firm threshold estimator of Gao and Bruce (1997). 
The firm threshold estimator is always between the soft threshold estimator 
b s (z) = sgn(z)(\z\ — A)+ and the hard threshold estimator bh{z) = zl{\z\ > 
A}. Actually, b s (z) < b{z) <bf(z) < b^z) for z > and the opposite inequal- 
ities hold for z < for all solutions of (7.1), given a fixed 7A in (2.3) or a 
fixed maximum concavity n(p m ) = 1/7 with 7 > 1. We plot these univariate 
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estimators in Figure 9 along with the univariate SCAD estimator. For p = 1 
and n{p2) = 1/7 > 1, the MC+ path (2.7) has three segments and (2.8), 
identical to the hard threshold estimator, globally minimizes the penalized 
loss. See Figure 9 on the left. Antoniadis and Fan (2001) observed that in 
the orthonormal case, the global minimizer (7.1) for the penalty (2.1) with 
7 = 1/2 yields the hard threshold estimator. In fact, in the univariate case, 
any penalty function with concave derivative p(t; A) and 7 < 1 in (2.3) yields 
the hard threshold estimator as the global minimizer in (7.1). 

The analytical and computational properties of penalized estimation and 
selection for general correlated X and concave penalty is much more com- 
plicated than the case of p = 1, since they are determined in many ways 
by the interplay between the penalty and the design. To a large extent, the 
effects of the penalty can be summarized by the threshold factor 7 for the 
unbiasedness in (2.3), the maximum concavity k(p;X) in (2.2) and their re- 
lationships to the correlations of the design vectors. This naturally leads to 
our choice of the MCP as the minimizer of n(p; A) given the threshold factor 
7 and the role of 7 = l/n(pi) as the regularization parameter for the bias 
and computational complexity of the MC+. 

7.4. Adaptive penalty. The PLUS algorithm applies to the penalized loss 

v 

(7.2) (2n)- 1 ||y-X/3|| 2 + ^A 2 p m (|/3i|r i /A), r, > Vj, 

i=i 

through the scale change {xj,/3j} — > {xjrj,(3j/rj}. It can be easily modified 
to accommodate different quadratic p m of the form (3.1) for different j. For 
example, different 7 = jj can be used with the MC+, so that the jth path 




1 2 z 1 2 z 



Fig. 9. Left: the univariate hard, soft and MC+ paths in z©6efl(Bfl r *=l 2 with a 
vertical dotted line at z — 7 = 1/2. Right: the hard, MC+ / firm and SCAD paths for p = 1 
with 7 = 5/2. Hard and soft path in solid, and additional segments of MC+ and SCAD in 
dashed lines. 
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(3j(\) reaches the unbiased region when \/3j(X)\rj/X > jj. This allows rj and 
7j to be data dependent. For rj = 1, the unbiasedness condition 7.,- A < \/3j\ 
allows a higher level of convexity than (2.9) does. 

Zou (2006) proposed an adaptive LASSO with X 2 pi(\/3j\rj / X) = Xrj\(3j\, 
where rj is a decreasing function of an initial estimate of /3j. The idea is 
to reduce the penalty level or the bias for large/nonzero \f3j\, but its ef- 
fectiveness for selection consistency essentially requires the initial estimator 
to be larger than a (possibly unspecified and random) threshold for most 
large/nonzero \f3j\ and smaller than the same threshold for most small/zero 
\/3j\. This approach was proven for bounded p = rank(X) to provide se- 
lection consistency (1.4) in Zou (2006) and Zou and Li (2008). Marginal 
regression x,y/||x,- 1| 2 can be used as an initial estimate of /?,• and is proved 
to result in the selection consistency of the adaptive LASSO under a certain 
partial orthogonality condition on the pairwise correlations among vectors 
{y,xi, . . . ,x p } [Huang, Ma and Zhang (2008)]. 

7.5. General loss functions. Consider the general penalized loss L(/3; X) = 
tp(fl) + Sj=i P(\Pj\'> A), where ip{0) = ^> n (/3;X,y) is a convex function of 
P E W given data (X,y). In generalized linear models, n^ n (/3;X,y) is the 
negative log-likelihood. With (ipj)pxi and (ipje) p x P being the gradient and 
Hessian of ip, (2.7) must satisfy 

(? 3) ( W {X) ) + ^0f)p^t ] \^ x) ) = 0, % a) + 0, 

\\h<p {x) )\<\to, ^ x) = o. 

Let s( x ) = dp (x) /dX^ and = {j: p& / 0}. Differentiation of (7.3) yields 

(7-4) { £ hil&SP) + P(\^ x) \;\ {x) )sf = "$ x) ^ (x) ) 

for j E and sf ] = for j $ A^ x \ where a{t; X) = - sga(t)(d/dX)p(\t\;X). 
This provides the local direction of the next move and thus allows an ex- 
tension of the PLUS algorithm. The main difference of such an extension 
from (3.8) is that the step size has to be small when ?/>(•) is not a quadratic 
spline. The main difference of such an extension from the computation of 
the LASSO for the generalized linear models [Genkin, Lewis and Madigan 
(2004), Zhao and Yu (2007) and Park and Hastie (2007)] is the possibility 
of the sign change dX^/dx to allow the path to traverse from one local 
minimum to another. Extensions of the LARS with large step size have 
been considered by Rosset and Zhu (2007) for support vector machine and 
by Zhang (2007a) for continuous generalized gradient descent. 
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7.6. Sub-Gaussian errors. Remark 3 in Section 2 mentions the validity 
of our theorems when the normality condition e ~ N(0, cr 2 I n .) in (1.2) is 
replaced by a sub-Gaussian condition on the error vector. Here, we provide 
some details. 

Proposition 3. Let e G M. n be a random vector satisfying the sub- 
Gaussian condition Eexp (x'e) < e ll x ll 2,T ?/ 2 for all xeR". Then for projec- 
tions P of rank m 

pK > l + * rrr) < e" m * /2 (l + xT' 2 Vx > 0. 

The normality condition is used in our proofs only to provide upper 
bounds for the tail probabilities of u'e and ||Pe|| 2 /m. The sub-Gaussian con- 
dition implies P{a!e/ai >t}< e"* 2/2 < (t + l/t)$(-t) for t > and ||u|| = 1, 
comparable to the normal tail probability. Proposition 3 is comparable to 
the Xm/ m tail probability bound needed in our proofs. 

APPENDIX 

In this appendix, we provide all the proofs. Theorem 1 is a special case of 
Theorem 5 and Theorem 2 concerns estimation in the same special case. The 
proof of Theorem 5 requires Theorem 6 and the proof of Theorem 7 requires 
Theorem 9 and Proposition 2. Thus, the proofs are given in the following 
order: Theorems 3, 4, 6, 5, 1, 2 and 9, Proposition 2, Theorems 7 and 8 and 
then Proposition 3. Two lemmas, needed in the proof of Theorems 6, 5 and 
2, are stated before the proof of Theorem 6 and proved at the end of the 
Appendix. 

Proof of Theorem 3. Let X be fixed. Define dk{r}) = #{j ■ \rjj\ = k}, 
k = 1,2. We consider three types of indicators G {— 2, — 1, 0, 1, 2} p with 
77 = as type-1. 

Type-2: cfefa) >nAp. Let (rz) b G S(rj) as in (3.5), so that (3.3) holds 
with Zj = TZj = Tx'jy/n. Since P2(\bj\) = for \rjj\ = 2, (3.3) implies x^-(ry — 
Xb) = for all \rjj\ = 2. Since ry — Xb G R n and {xj, \rjj\ = 2} contains at 
least n Ap linearly independent vectors, by (3.6) 

(A.l) lf 2 ^L-K nA /,^ =► X'(ry-Xb) = 0. 

v ; \ (rz) b G £(r}\z) v J ' 

Type-3: di(j)) < nAp and r] 7^ 0. If (3.3) holds for z = 0, then ftjX^-Xb/n = 
bj'x! j 'b = -\b j \p2Qb j \) for all b^O, so that ||Xb|| 2 /n = - £\ |^|p 2 (|^|) = 
due to p2(\bj\) > 0. Since p2(\bj\) = (1 — \bj\/j) + > for \bj\ < 7 and \bj\ < 7 
for |?7j| = 1, bj equals either or jrjj for \r}j\ = 1 in such cases. Therefore, 
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xb = S| % |=2^' x i +7l]fe fe ^o,|r ?fc |<2 r /fc x fc = °- Tnis is impossible for nonde- 
generate X since 7 > and d^ifj) <nAp. Thus, © b ^ S(rj) for indicators 
rj of type-3. 

We now consider the choice of 7 for the MCP. It follows from (3.2) and 
(3.11) that the determinant det(Q(r/)) is a polynomial of v\ = I/7 with 
det(Sj. |_ 2 )(— vi) dl ^ as the leading term and det(S^. 1^1=2) / f° r type- 
3 77 by (3.20). Let To(X) be the finite set of all reciprocals of the real roots 
of such polynomials with type-3 77. We choose 7 ^ To(X) hereafter, so that 
det(Q(T/)) / for all rj of type-3. Since det(Q(r])) / 0, in S(t]) the vector 
(bj,rjj 7^ 0)' is a linear function of z by (3.12), so that by (3.6) and the 
discussion in the previous paragraph 

(d (v)<nAp (det(Q(77))/0, 
(A.2) W { i(r,\z) is a generalized line segment, 

l 77 ^ ' (offib^^z) Vb. 

Here, a generalized line segment includes the empty set, single points in 
H © H* = ]R 2p , and line segments of finite or infinite length. 

For each nonzero z £ H = W, we define a graph G(z) with l(r]\z) of 
positive length and type-3 r] as edges and the end points of edges as vertices. 
The graph G(z) is not necessarily connected. A vertex in G(z) is terminal 
if it is also a boundary point of S(rj) for some rj of type-2. If the PLUS 
path reaches a terminal vertex (rz) © b, then b/r provides an optimal fit 
by (A.l). The degree of a vertex in G(z) is the number of edges connected 
to it. 

Suppose z / 0. At step k = 0, the MC+ path reaches (r (0) z) © b(°) as 
a boundary point of S(0). Since the p-parallelepipeds (3.5) are contiguous, 
{t^z) © b^ ) is also a boundary point of S(rj^) for some r/ 1 ) satisfying 
either (A.l) or (A.2) with z = z. If T7W is of type-2, then b(°) /t(°) gives an 
optimal fit and the MC+ path ends with k* = 0. Otherwise, the MC+ path 
enters the graph G(z) at the initial vertex (r^z) © b*- ^ If the degree of 
the initial vertex is odd and the degrees of all other nonterminal vertices are 
even, then the MC+ path traverses through G(z) and eventually reaches a 
terminal vertex in one pass. This is simply an Euler's Konigsberg problem. 

Let So be the union of all intersections of three or more distinct p- 
parallelepipeds S(r/), rj £ {-2, -1,0, 1,2} P , and H = {z: (rz) © b G S for 
some r and b}. Since the interiors of the p-parallelepipeds S(rj) do not in- 
tersect, the intersections of three distinct S(rf) are (p — 2)-parallelepipeds, 
so that the projection of So to the (p — l)-sphere {z : ||z|| = 1} along the rays 
{tz, t > 0} has Haar measure zero. Consequently, Ho has Lebesgue measure 
zero in H = MP. 

For z ^ Ho, each vertex in G(z) is a boundary point of exactly two p-par- 
allelepipeds S(rj), so that the initial vertex has degree 1 and other nonter- 
minal vertices have degree 2 in G(z). Thus, the initial vertex is connected 
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to a terminal vertex in G(z) in a unique way for z ^ Hq, and the conclusions 
of part (i) holds by (A.2). 

For z S Hq, the initial vertex is still connected to at lease one terminal 
vertex in G(z) since Hq is dense in H = M. p , and the limits of G(z) as z — > z 
are subgraphs of G(z). Hence, the conclusion of part (i) hold in either cases. 

Parts (ii) and (hi) hold since det(Q(?7)) 7^ almost everywhere for fixed 7 
and type-3 ij. For part (iv), we consider z ^ Hq. We have already proved the 
uniqueness of the graph and that the path ends with perfect fit at a type-2 
77. Since the vertex (r^ -1 ^) © h>( fe-1 ) is a boundary point of exactly two 
p-parallelepipeds S^^ -1 )) and S(ri^), in (3.9) uniquely indicates 

the side of the intersection. Since the edges must pass through the interior of 
the p-parallelepipeds, rf k > and are given by (3.10) and (3.13). The slope 
is uniquely determined by (3.12) due to det(Q(rf^)) / 0. The hitting 
time Aj in (3.14) is computed from the current position (r^" 1 ^) ©b^ fc_1 \ 
the slope and the inequalities for the boundary of the p-parallelepiped 
( 5(^( fc )) i n (3.5). Since the length of the edge is positive, A^ > in (3.15). 
Since the path does not return to 0, r^) > in (3.15). 

For part (v), (z © (3^ ^)/A is in the interior of S(r}^) at A = for 
k — 1 < x < k, so that (3.3) and thus (2.6) hold with strict inequality. By 
(3.11), 

(d/dt)L0 (x) +tb;X) 

= tb' 1 Q(r 7 «)b 1 + £ |^|{A-sgn(6,)x;.(y-X3 W )/^ + OW} 
,f=o 

is positive for small t > 0, where bi = (bj,i]j k ^ ^ 0)'. Thus, /3 ^ is a local 
minimizer. □ 

PROOF of Theorem 4. Since (3° is the oracle LSE, x^(y - X/3°) = 
for j G A°. If |i9? I > A7, then p(|/3?|; A) = by (2.3). Thus, 3° is a solution 
of (2.6) and sgn(/3 ) = sgn(/3) for all Ai < A < A2 in the intersection of 

fl?(Ai) = (max|x,-(y - x3°)/n| < Ai), 

( A -3) 




i S gn(/3 j )^> 7 A 2 }. 



Moreover, since the solution of (2.6) is unique, f3 = /3(A) for all Ai < A < A2 
in this case. 

Let P° be the orthogonal projection from R n to the linear span of {x,-, j € 
A°}. Since y — X/3 = (I„ — Pf)e, x j(y ~~ X/3 ) jn are normal variables with 
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zero mean and variance bounded by <r 2 ||xj || 2 /n 2 , so that 1 — P{Q°(Xi)} < 
7r n ,i(Ai). By (2.10) and (4.1), /3° ~ N(/3 j ,a 2 w° /n) for all j £ A°. Since 
> P* > 7A2, we have 1 - P{fi£(A)} < 7r ni2 (A 2 ). Inequality (4.3) follows 
by combining the above two probability bounds. □ 

Let us state the two lemmas. For m > 1 and B C {1, . . . ,p}, define semi- 
norms 

(A.4) C(v; m, B) = max{ ~ ^f 2 HI :B CAc{l,...,p},\A\=m + \B\ 
for v 6 R n , where is the orthogonal projection from W 1 to the span of 

{XjijG DI- 
LEMMA 1. Suppose (2.11) holds for X w;it/i certain d* and c* > c* > 
k > 0. Xei iT* be as in (4-5) with an a £ (0,1), and B C {l,...,p} with 
\B\ < d*/(K* + 1). Let A > be fixed and p(t;X) be a penalty satisfying 
A(l - Kt/X)+ < p(t;X) < X for all t > 0. Let 1 < m < m* = d* - \B\ and 
y 6 1" with (\/c*/a)C(y; m > B) < X, where ((•;m,B) is as in (A.4)- Let 
A©3 be a solution of (2.6), BU {j :0j 0} C A x C BU{j : |x£(y-X3)/n| = 
p(\/3j\;X)} and 3° be as in (2.10). If \Ai\ = \B\ + m, then 

(A.5) \A X \ - \B\ <K^p 2 {\Pj\-X)/X 2 <K*\B\. 

jeB 



If X> (V&/a)C(y,m*,B), then #{j ^ B : |x^(y — X/3)/n| = p(\fij\;\)} < 
IV (K*\B\) and 

c||3-3°ii< v^>l|x(3-3°)ll 

(A.6) 

r ^ 1 1/2 , 

< j ^P 2 (l/?il;A) [ + aA v / i^|£|c*/c*. 

Lemma 2. Le£ £(v;m, .B) be as in (A.4) with deterministic m and B. 
Let p € > sfe. be the solution of (2.12) with d° = \B\. Suppose e ~ N(0,o~ 2 I n ). 
Then 

,2 1 



(A.7) P{C(e;m,£) > <ry/(2/n) logpj < — / / !^L /i) < -^L < 4=, 

Vlogp, V lo gP<: ^2 



where p = {21ogp e — 1 + l/m}y/m/ y / 2logp e . 

Proof of Theorem 6. Let = #{j : j e 5 or |x£(y - X./3 ix) )/n\ 
p(\Pj X) \;\( x) )} and xi = inf {x > : X& < Ai or A^ < (Vc*/a)C(y; m, £))} 
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with the A (x) © 3 in (2-7) and m = d* - \B\. We first prove d^ < d* for 
< x < x\. Let A\ be any set satisfying 

BU{j-W^0} 

(A.8) 

C A< x) CBU{i: |x^(y - ^ {x) )/n\ = p0*\ A«)}. 

By (2.6), the left-hand side is always a subset of the right-hand side in (A.8). 
Moreover, since (3 is continuous in x, sgn((3j X ^) = sgn(/3^) = sgn(/3^ x+ ^) 

fails to hold only if j3 [x) = and |x£ (y - x3 (2 ° ) /n| = p(0; A^ ) = A^ , so that 

we are allowed to add variables to |^4^| one-at-a-time. Thus, since (3^ = 0, 
if d[ X2) > d* for some < X2 < x±, there must be a choice of with 

r x \ 

\A\ | = (i* and < x < X2- On the other hand, it follows from Lemma 1 that 

\& > Ai \/{(Vd*/a)((y;m*,B)} and \B\ < \A[ x) \=d* imply \ a[ x) \ < {K* + 

l)|i?| < d*, where m* = m. Thus, \B\ < \A[ x) \ =d* can never be attained for 

< x < xx. It follows that #{j iB:fif ] / 0} < - |B| < 1 V (K m \B\) 

for all < x < x\ by Lemma 1. 

Let A 4 = (Ty/(2/n) logp £ + e B /y/rn. By (2.8), A © 3 = A (x) © 3 (20 with 
a certain A^ > Ai V (A4 \/c*/a), so that the left-hand side of (4.8) is no 
greater thanP{0|} with fi 4 = {C(y; m,-B) < A 4 }. Since C(X/3; m, B) < ||(I n - 

P B )X/3||/vW by (A.4) and Q B = \\X3 - XEp°\\ = ||(I n - P B )X/3|| by 
(2. 10), ((y,m, B) < ((e;m,B) + B /^i. Thus, P{0|} < P{((e;m,B) > 
0\J (2/n) logpe}- The conclusion follows from Lemma 2. □ 

Proof of Theorem 5. Consider the event £1 = (~f j=1 ttj(Xj), where 
Q,j(Xj),j = 1,2, are as in (A. 3) and ^3^3) = {((e; m, A°) < A3}. It follows 
from the proof of Theorem 4 that A © 3 is a solution of (2.6) for all Ai < 
A < A2. Since n(p; A2) < k < c*, the sparse convex condition (2.5) holds with 

rank d* , so that Affi/3 is the unique solution of (2.6) subject to Ai < A < A2 
and + > 0} < d* . Since ((y; m, B) = £(e; m, A°) with B = A° 

in (A.4), we also have A2 > (\/c*/a0^3 > (y/cF /a)((y;m, B) in f2. 

In the event f2, consider the path X^ ®3^ ^ with < x < x\ = inf{x : A^ < 
Ai}. Let A [x) be as in (A.8). If \A {x) \ = d* and A (x) > A 2 , then Lemma 1 pro- 
vide < d*. If \A (x) \ = d* and Ai < A^ < A 2 , then the uniqueness of 
A©3° implies p (x) = 3°. Since |x£(y-Xj3°)/n| < Ai < p(M (x) ) for j £ A°, 

(x) (x) 

we have ^ = ^4°. Thus, |^4^ | = d* can never be attained with < x < x\ 1 
and /3(A) = 3 for all Ai < A < A2 in the event f2. 
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We still need to bound 1 — P{£1}. The proof of Theorem 4 provides 
1 - P{nj(Xj)} < iTnj(Xj) for j = 1,2. By (A.4), ((e;m*,A°) is the maxi- 
mum of ) Xm variables, so that 1 — Pjf^Aa)} < ^^(As). Thus, (4.6) 
holds. Finally, (4.7) follows from (4.6) with applications of the inequality 
e t2 l 2 ${-t) < min{l/2, l/(tV2vf)} and Lemma 2. □ 

Proof of Theorem 1 . Theor em 1 follows from Theorem 5 with a = 
1/2, since j = 1/k> c~ l y/A + c*/c* implies + 1 < c* /c* + 1/2 in (4.5) as 
in Remark 5. □ 

Proof of Theorem 2. As in the proof of Theorem 1, we have < 
c*/c* - 1/2 in (4.5) with a = 1/2. Let m = m* = d*-d°> (c*/c* - l/2)d° > 
d°/2. As in the proof of Theorem 6, for the A in part (i), Lemma 2 gives 
P{2^/7j;C(y; m, B) > A} < 6/^/4 log p £ . Thus, (2.16) follows from (A.6). It 
remains to prove (2.17) with p\ in (2.16). 

We first bound p\. Since m\ > (m/e) m and m>d° = R r /A^ m , by (2.15) 

< 2.o g (f ) £ 2,o E (^) .„^ + Ho E (-^). 

Thus, by (2.12), 0-^/(2/71) log pi < A mm + eio/y/n for large nA mm /cj 2 . 

Let P £ 6r,i? and B k be the set of j for the d° largest \f3j\ with j ^ BqU 
• • • U -Bfc_i, A; > 1, with £?o = - Let B = B\ and t>j = |/3j| A A mm . Since \/3j\ < 
WvB^Wt/d for j G 5 fc and A; > 2, £ fe > 2 PbJI/v 7 ^ < Ek>2 l|vB fc _ a 111/ 
d° = ||v||i/d° < R'X^/d = A mm . Thus, 6 B = || (I n - P B )X0||/v^ < 
Efc>2ll x B fc i9B h ll/v^< V^Anam by (2.11). Since c*d° < 2c*m, 9 B /y/TE + 
ay / (2/n)logpi < (\/2c7+ l)A mm + tia/y/n = A/(2\/c*), so that 

(A.9) sup P{c*||3(A) -3°|| > (S/2)\Vd?} 

by (2.16). Since A = 2\fc*\ mm {\ + \/2c* + o(l)), by the Holder inequal- 
ity, (A.9) and (A.5) imply that ||3(A) - 3°ll<? < l^i| 1_<?/2 ||3(A) - P°\\ q < 
Af^Ammd with large probability. Moreover, we have ||/3 — E(3 || 2 < 
P {l)d°cj 2 /(nc*) = o P (\ 2 mm d°/c*) and \\Ep° B - P B \\ 2 = WH^Hb^Pb'W 2 < 
(c*/c*)(Z k > 2 \\P Bk \\) 2 < (c*/c*)\l m d°^ Since \\P B 4\ < R r \\P B 4^ r < 
^mmd°, these inequalities imply that ||/3 — P\\\ = ||/3 B — P B \\q + \\P B c\\q < 
\B\ l - q l 2 {{o P (1/c ) + c*/c ) A mm rf°} 9/2 + Ammd° < Ml q \ q mm d° with large prob- 
ability. We obtain (2.17) by combining the upper bounds for ||/3(A) — P ||<j 
and \\P°-P\\l □ 

Proof of Theorem 9. (ii) =>■ (iii): let A be fixed and A = p(0+;A). 
Define h(t) = K(p;X)t 2 /2 + p(\t\;X) — \o\t\. Since k(p;X) is the maximum 
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concavity in (2.2), h(\t\) is a continuously differentiable convex function in 
M. It follows that the penalized loss 

L(b; A) = |i-||y- Xbf - ^||b|| 2 } + X>N + h(\ bj \)} 
1 J j=i 

is a sum of two convex functions, with the first one being strictly convex for 
Cmin(S) > n(p; A) and the second one being strictly convex otherwise. 

(iii) =^ (i): since the penalized loss L(b; A) is ||y|| 2 /(2n) for b = 0, y — > (3 
maps bounded sets of y in R n to bounded sets of (3 in MP. Since L(h; A) is 
continuous in both y and b and strictly convex in b for each y, its global 
minimum is unique and continuous in y. 

(i) (ii): since (3 depends on y only through z = 'X.'y/n and X is of rank 
p, the map z — > (3 is continuous from MP to its range J" . Since (3 is the global 
minimum, (2.6) must hold and the inverse (3 — > z = + sgn(/3)p(|/3|; A) is 
continuous for (3 E (0, oo) p n J^, with per component application of functions 
and the product operation. It follows that (0, oo) p n / is open and does not 
have a boundary point in (0, oo) p . Let 1 = (1, . . . , 1)' G M p . For z = xl^l 
with x > 0, L(xl;X) = o(x 2 ) for the ordinary LSE xl by the first condition 
of (5.15), and L(b;A) is at least c m ; n (X)x 2 for any b outside (0, oo) p . Thus, 
(0, oo) p n Jf is not empty. As the only nonempty set without any boundary 
point in (0, oo) p , (0, oo) p n J* = (0, oo) p . Moreover, the map z — > (3 is one-to- 
one for (3 E (0,oo) p . 

We have proved that all points (3 in (0,oo) p are unique global minimum 
of (1.1) for some z E MP. Let (3 = xl E (0, oo) p and b be the eigenvector with 
Xlb = c m i n (5])b and ||b|| = 1. The quantity 

t- l -^L0 + tb;X) 
p 

(A.10) = ||Xb|| 2 + Y,t- 1 sMPjMf>(\fr +tbj\;X) -p(\Pj\;X)} 



= c min (s) + t x W(x + % A ) - pip; A)} 

i=i 

must have nonnegative lower limit as t — > 0+. Integrating over x E [^1,^2] 
and then taking the limit, we find 

cmin(£)(t 2 - *i) + Pfc, A) - p(h; A) 

(A.ll) 

rl2 



f 2 5 

lim / — L(xl +tb: A) > 0. 

t^o+J h dt 



(A.12) 
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It remains to prove that (A. 11) holds with strict inequality. If (A. 11) holds 
with equality for certain < t\ <t2, then for t\ < x < ti and small t (A. 10) 
becomes 

g P 
t' 1 —L{(5 + tb; A) = c min (S) + ^M-CmintS)^} = 0. 

This is contradictory to the uniqueness of p\ □ 

PROOF of Proposition 2. Let P be as in Theorem 7. We write (2.6) 

as 

fPE3 + Psgn(3)p(|j3|;A)=Pz, 
\|%-xpc3/n|<A,Vi. 

Let r\ € {—1,0, 1} P be fixed (not confused with the rj in Section 3). It follows 
from Theorem 9 that the map Pz — > P/3 is continuous in z G W and con- 
tinuously invertible given a fixed sgn(/3) = rj. Let H{rf) = {z:sgn(/3) = 77}. 
The boundary of H{rj) has zero Lebesgue measure, since it is contained 
in the set of z satisfying r)j(3j = 0+ for r/j ^0 or zj — x^X/3/n = ±A for 
Vj = 0> J = according to (A.12). In the interior of H^ij), (A.12) 

gives (d/dzj)f3 = and (d/dz)(3j = for rjj = and 

P i(Pi)' = PSP + Pdiag(p(|&|; A))P' = Q(3; A). 

Since (2.5) holds with d* = p, c min (Q(/3; A)) > c m \ n (S) - n(p;X) > for 
all j3^0. Thus, the differentiation of the inverse map yields (d/dz)f3 = 

p / Q~ 1 (3;A)P. □ 

Proof of Theorem 7. It follows from Proposition 2 that (3 - £ -1 z is 
almost differentiable in z with derivative 

JL(3 - s- 1 ?)' = p'q- 1 (3; A)P - S" 1 . 

az 

Since z = X'y/n ~ iV(£/3, Ecr 2 /n), this and (5.3) imply 

e0 - s^ 1 ^)(s~ 1 i - py = e0 - s- x i)(i - sp-ysr 1 

2 

= — {SP / Q- 1 (3;A)P-S- 1 }. 

n 

Since the ordinary LSE is 3 = £ _1 z ~ N(J3, £ _1 cr 2 /ra), it follows that 
E(fl-0)(p-0)' 
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= E0 - 0)0 - 0)' - E(0 - 3)(/3 - 0)' + 2E0 -0)0- 0)' 

= E0 - 0)0 - 0)' + —{EP'Q~ l 0; A)P - 5T 1 } + —XT 1 . 

n n 

This proves (5.5). The rest of the theorem follows immediately. □ 
Proof of Theorem 8. Since trace(bb') = ||b|| 2 , (5.5) gives 
E\\fi - /t|| 2 = £ j H/2 - /i|| 2 + ^- trace(XP / Q~ 1 (3; A)PX')| 

2 

- — trace(X£ _1 X') 

n 

= E{\\p, - /I|| 2 + 2cr 2 df - a 2 rank(X)}, 



which implies (5.11) via (5.8). For (5.12), we observe that Q0; A) = PEP' 
by (5.4) when p{\%\\ A) = for all ^ / 0. □ 

Proof of Proposition 3. Let ui, . . . , Ujv be vectors in the unit sphere 
gm-i Q £ fo e range of P such that balls {v:||v — Uj|| < ei} are disjoint 
and Uj=i{v: ||v — u.j|| < 2ei} D S m ~ l . Volume comparison yields Nef" < 
(1 + ei) m - (1 - ei) m . Since v'e = u'e + (v - u)'e, ||Pe|| = max v65 m-i v'e < 
maxj<jvu^e + 2ei||Pe|| < maxj<Ar u^e/(l — 2ei)+. It follows that P{||Pe|| > 

< (1 + l/ei) m e-^~ 2ei ^ t2 / 2 . Taking i 2 = m(l + a?)/(l - 2ei) 2 , we find 

p{ ||Pe|| 2 /a 2 > ^2^2 } ^ U + l/ei) m e" m(1+a:)/2 < e~ mx ' 2 (l + rr) m / 2 

for (1 + 1/ei) 2 = (1 + x)e x . This proves the proposition since e\ = l/{e x / 2 x 
y/T+X-1). □ 

Proof of Lemma 1. Let Xi = X J 4 1 as in (2.4) and En ^X^Xi/ra. 
Since |Ai| < cf , 

(A.13) c,<^H^< c *, I<El^<1 vo^vgrI^i, 

||v|| c* ||v|| c* 

by (2.11). Set A 2 = {1, . . . ,jp} \ A u A 3 = B and A4 = Ay \ B. Define b k = 
(bj,j G for bGl p and fc = 1,2,3,4. For k = 3,4, let Q k be the matrix 
representing the selection of variables in A k from A%, defined as Qfcbi = b&. 
Let be the oracle LSE in (2.10) and e = y — X/3 = (I n — Ps)y. Since 
= P2 = 0, the components of the negative gradient 

(A.14) g = X'(y-X3)/n 
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must satisfy gi = X' x (y — Xi/3 1 )/n = X^e/n + Sn(/3 1 — fli), so that 
(A.15) E^gi + (3i - 3i) = S^Xie/n. 

Let Vl = S n 1/2 gl and v fc = £ n 1/2 Q' fcgfc , fe = 3,4. Let P x = XiSi 1 1 X' 1 /n = 
P^j be the projection to the range of Xi as in (A. 4). Since A\ D -B, Pi£ = 
(Pi-P B )y, so that llPiif/n < |A 4 |C 2 (y; |A 4 |,£) by (A.4). Thus, for A> 
(\/c*/a;)£(y; | ^4 1 , as provided, 

(A.16) gJbQfcEi/Xie/n^ ||v fc ||||Pii||/v^< ||v fc ||aAV|^4|/c*. 

Since Q4(/3 1 — = f3 4 — /3 4 = — /3 4 and V3 = vi — v 4 , by (A.15) we have 

l|v 4 || 2 - ||v 3 || 2 + || V! || 2 = 2v 4 V! = 2g 4 Q 4 £ n 1 g 1 = 2g 4 Q 4 S n 1 X' 1 ?/n- 2g 4 3 4 . 

Since 2||v 4 ||A v / | A A \/c* < ||v 4 || 2 + A 2 |A 4 |/c*, the above identity and (A.16) 
yield 

(1 - a)||v 4 || 2 + || Vl || 2 + 2g 4 3 4 < ||v 3 || 2 + a\ 2 \A±\/c*. 
Similarly, it follows from (A.15) and (A.16) that 

I|v4ii 2 + 2 g4 34 + l|£i{ 2 (3i-3i)ll 2 

= ||v 4 || 2 + 2g 4 3 4 + || Vl || 2 - 2g' 1 £ n 1 X / 1 i/n + ||Pii|| 2 /n 

= ||v 3 || 2 + 2g 4 Q 4 S n 1 X / 1 i/ ? i - 2g , 1 £ n 1 X / 1 £/7i + ||Pii|| 2 /n 

= l|v 3 || 2 - 2g / 3 Q 3 £ n 1 X / 1 £/7i + ||Pii|| 2 /n 

<||v 3 || 2 + 2||v 3 ||aAVlM^+ « 2 A 2 |A 4 |/c* 

due to gi = g^Q 3 + g 4 Q 4 . For the w = (2 - a)/{c*c*/K 2 - 1) in (4.5), the 
{1,^} weighted sum of the above two inequalities yields 

LHS= (l-a + uO||v 4 ||!+ ||vi|| 2 + (l+u»)2g 4 3 4 
+ W ||^/ 2 (3 1 -3i)H 2 

(A.17) 

< (l + ™)||v 3 || 2 + (a + m* 2 )A 2 |A 4 |/c* 
+ 2w;||v 3 ||aAv / |^4|/c*. 

Note that (A.17) holds with equality only in the following scenario: ||v 4 || 2 = 

A 2 |A 4 |/c* and (A.16) holds with equalities for both v 3 and v 4 . Since \A±\ = 
1/2 

I Ai| — \B\ > and S X1 = QkSk have different support for k G {3,4}, this 
scenario could happen only if ||v 3 || = 0. Thus, (A.17) holds strictly unless 
l|v 3 || = ||g 3 || = 0. 
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We first bound the LHS. Since A(l - «|t|/A)+ < p(\t\;X) < A, (2.6) and 
(A. 14) provide 

I|g4|| 2 v pfc>r/i jSlV 

(A.18) 

Hsall 2 < p 2 (|^|;A) 



+ 



Z-^ \2 



A 2 " ^ A 2 

in view of the second condition on A± = A4U B. We also have fi° = and 

^ ^ —1/2 

/3j<7j = |/3j<7j| for j S A4. Thus, by (A. 13) and the definition = £ n Q' k gk, 
LHS = (1 - a + w)\\v 4 \\ 2 2 + || Vl || 2 + (1 + w)2g^3 4 

+ W ||s}{ 2 (3 1 -3i)ii 2 

>(l-a + W )||g4||i/c* + || gl || 2 /c* 
+ (l + u;)2gl3 4 + u; C ,||3 1 -3°|| 2 



(A.19) 



> A 2 {( 2 " ol + w)(l - Ktj) 2 + /c* 

He II 2 

+ (1 + tu)2(l - Ktj) + tj + WC*t 2 } + " 



c 

2 /„* 



> A |^4 1 min {(2-a + u;)(l-Kt)7c 

0<ftt<l 

lie II 2 

+ (1 + w)2t(l - Kt) + wc*t 2 } + ^r-» 

where ij = |/3j|/A. Since c* > c* > k and 10 = (2 — a)/(c*c* /k 2 — 1), we have 
(2 - a + w)k 2 /c* - (1 + w)2k + wc t 
= 2{wc* — re(l + w)} 
_ 2 (2 - a)c* - k(c*c*/k? + 1 - a) < 

C*C* / K 2 — 1 

due to — c*a < and — c*c*/k 2 — 1 + 2c*/k < —(c*/k — l) 2 . Thus, the 
minimum in (A.19) is taken over a concave quadratic function with equal 
value at {0, 1/k}, so that 

(A.20) LHS > A 2 |A 4 |(2 - a + w)/c* + ||g 3 || 2 /c*. 



Inserting (A.20) into (A. 17), we find 
A 2 |A 4 |{2 - a + w - (a + ■wa 2 )}/c* 
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< (1 + w)\\v 3 \\ 2 - \\g 3 \\ 2 /c* + 2H|v 3 ||aAVl^4|/c* 

< (l + w )||v 3 || 2 - \\g 3 \\ 2 /c* +wa( }^\ +t{l-a)\ 2 \A i \/c* 



t(l - a) 

and that the strict inequality holds unless 1 1 V3 1 1 = ||g3|| =0. We move wat(l — 
a) X 2 |Ai I /c* to the left-hand side and then multiply both sides by c*/X 2 to 
arrive at 

{2 + w(l + a) - twa}{\ - a)\A A \ 

(A.21) < (1 + w{l + (a/t)/(l - a)})c*||v 3 || 2 /A 2 - \\gs\\ 2 /X 2 

< { (i + w{1 + {a/t )/(l - a)})c*/c* - l}||g 3 || 2 /A 2 

due to c*||v 3 || 2 < ||g 3 || 2 by (A. 13). The strict inequality holds above, since 
the equality would imply ||g 3 || = and then \A±\ = 0. This proves (A. 5) via 
(A.18). 

For (A.6), it follows from (A. 15), 0' 4 - pl)'g 4 > and then (A. 13) and 
(A. 16) that 

= -(3i - 3i)'gi + (3i - 3i) / X' 1 £/n 
<||3 3 -3;illlg3|| + l|Xi(3 1 -ffl||||Pie||/n 

< i|s^ 2 (3 1 - 3i)lllls3ll/VHr-K ns^ 2 ^! - 3i)|| c ,aVI^Ii7^- 

Dividing both sides by ||5jJ{ 2 (/3 1 — /3i)||, we find with another application 
of (A. 13) that 

c*||3 -3°|| < V^\\^n 2 0i ~ 3i)ll < ||g 3 || + aX^lA^/c*. 

Since ||X(3 - 3°)||/ v /n = ll^i{ 2 (3 - 3°)||, this proves (A.6) via (A.18) and 
(A.5). □ 

Proof of Lemma 2. Since m and B are deterministic, nm( 2 (£;m, B) /a 2 
in (A. 4) is the maximum of ( p ~ d ) variables with the Xm distribution, so that 

(A.22) P{((e; m, B) > a^J{2/n) logpj < ( P ~J° \ P{ X 2 m > m(l + x)} 

with x = 2 logp e — 1 > 0. Since Xm/ (1 + x ) has the gamma(m/2, (1 + x)/2) 
distribution, 

P{xl>m(l+x)} 

(A.23) 

m(l+x)/2 (i I „\m/2 roo 

= e ^i + a;; / m / 2 -i -(i+a;)(t-m)/2 dt 

T(m/2)2 m /2 
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Let y = y/t and h(y) = {l + x){y 2 — m)/2 — (m — 1) logy. Since (d/dy) 2 h(y) > 
(l + x), 



(A.24) 



t m/2-l e -(l+x)(t-m)/2 ^ 



with z = y/T+~x(y — \/rn) and fi = {dh/dy){\frn)/y/l + x = (x + \jm)yfmj 
y/l + x. Since 

e -m/22 e -/i(v / m) e -m/ 2 2 m ( m - 1 )/ 2 \ 

r(m/2)2 m /2 - ( m /2) m /2-i/2 e -m/2^:2 m / 2 ~ 7^ 
by the Stirling formula and x + 1 = 21ogp e , (A. 23) and (A.24) imply 

-mx/2(-i i r \m/2 poo 

P{ X l > mil + x)} < J^+g / e-^-V2 ^ 

A/27rlogp e Jo 

This and (A.22) imply (A.7), since (2^)~ 1 / 2 / °° e - '**-* 2 / 2 dz = e^ 2 / 2 $(-/i) < 
1/2 and (2.12) implies ( p ^f )e _mie / 2 (l + x) m / 2 = e. □ 
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